@zhixuan92/multi-model-agent 3.8.0 → 3.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -82,7 +82,7 @@ Two ways — pick one:
82
82
 
83
83
  ```bash
84
84
  mmagent serve # 127.0.0.1:7337 by default
85
- curl -s http://localhost:7337/health # → {"ok":true,"version":"3.8.0",...}
85
+ curl -s http://localhost:7337/health # → {"ok":true,"version":"3.8.1",...}
86
86
  ```
87
87
 
88
88
  For an always-on background install (survives reboots): [launchd / systemd templates](./scripts/README.md).
@@ -98,6 +98,72 @@ mmagent update-skills # refresh installed skills
98
98
 
99
99
  A drift warning prints on `mmagent serve` if installed skills are older than the daemon. To rotate the auth token: `rm ~/.multi-model/auth-token && mmagent serve`.
100
100
 
101
+ ## Skills
102
+
103
+ Skills are the surface your AI client sees. `mmagent install-skill` writes them to the client's skill directory; the client then picks the right one based on what you ask. You don't call them by hand — you describe the work, the client routes it to the matching skill, the skill calls the matching REST endpoint.
104
+
105
+ ### Work-delegation skills
106
+
107
+ | Skill | Target endpoint | Use when |
108
+ |---|---|---|
109
+ | `mma-delegate` | `POST /delegate` | Ad-hoc implementation or research tasks **without** a plan file — run them in parallel on cheap workers. |
110
+ | `mma-execute-plan` | `POST /execute-plan` | A plan / spec markdown exists on disk with numbered task headings; implement one or more tasks from it. |
111
+ | `mma-investigate` | `POST /investigate` | Answer a question about *this* codebase ("how does X work", "where is Y called") without burning main-context tokens on grep + reads. |
112
+ | `mma-debug` | `POST /debug` | A test fails, a build breaks, or behavior is unexpected — delegate the reproduce/trace, keep the hypothesis on the main agent. |
113
+ | `mma-review` | `POST /review` | Source-code review (pre-merge, post-implementation, security-focused). One worker per file, in parallel. |
114
+ | `mma-audit` | `POST /audit` | Audit a prose document — spec, config, PR description — for correctness, security, or style. |
115
+ | `mma-verify` | `POST /verify` | Check acceptance criteria against finished work *before* claiming done. One worker per checklist item. |
116
+
117
+ ### Plumbing skills
118
+
119
+ | Skill | Target endpoint | Use when |
120
+ |---|---|---|
121
+ | `mma-context-blocks` | `POST/DELETE /context-blocks` | The same large doc (>~2 KB) will be referenced by 2+ subsequent mma-* calls — register once, pass the ID instead of re-uploading. |
122
+ | `mma-clarifications` | `POST /clarifications/confirm` | A previous batch's terminal envelope returned a `proposedInterpretation` string — the service is paused waiting for you to confirm or correct its read. |
123
+ | `mma-retry` | `POST /retry` | A previous batch came back partial — re-run only the failed indices without re-dispatching the whole batch. |
124
+
125
+ The `multi-model-agent` skill (no `mma-` prefix) is a top-level overview your client reads first to pick which `mma-*` skill applies.
126
+
127
+ ### Two generic usage samples
128
+
129
+ **Sample 1 — implement a feature from a plan**
130
+
131
+ ```
132
+ You: "Execute tasks 3, 4, and 5 from docs/plans/auth-rewrite.md"
133
+
134
+ Client picks mma-execute-plan (plan file on disk, multiple independent tasks)
135
+
136
+ mmagent dispatches 3 workers in parallel on the standard agent (e.g. MiniMax-M2.7),
137
+ each runs cross-agent review on the complex agent, returns a structured report.
138
+
139
+ You see one consolidated headline: "$0.04 actual / $1.20 saved vs claude-opus-4-7 (30× ROI)"
140
+ ```
141
+
142
+ **Sample 2 — debug a failing test (multiple skills chained)**
143
+
144
+ ```
145
+ You: "tests/auth/session.test.ts is failing intermittently after the token-refresh refactor — figure it out and fix it"
146
+
147
+ Step 1 — mma-context-blocks
148
+ The failing test output + the refactor diff are ~8 KB and will be referenced by every
149
+ downstream call. Register once, get a contextBlockId, reuse it.
150
+
151
+ Step 2 — mma-debug
152
+ Worker reproduces the failure, traces across session.ts + token-refresh.ts, returns a
153
+ root-cause hypothesis: "race between refresh-in-flight and session.invalidate()".
154
+ Main agent stays on the hypothesis, decides the fix shape.
155
+
156
+ Step 3 — mma-delegate
157
+ Dispatch the actual code change as an ad-hoc task (no plan file). Worker writes the
158
+ fix, runs the failing test 20× to confirm the race is gone.
159
+
160
+ Step 4 — mma-verify
161
+ One worker per acceptance criterion: (a) failing test now passes, (b) no other
162
+ auth tests regressed, (c) refresh path still emits the expected telemetry.
163
+
164
+ Total cost: ~$0.08. Main-context tokens consumed: just the hypotheses and the verdicts.
165
+ ```
166
+
101
167
  ## Configuration reference
102
168
 
103
169
  ### Lookup order
@@ -199,24 +265,6 @@ mmagent telemetry reset-id # rotate the local Ed25519 iden
199
265
  mmagent telemetry dump-queue # print the locally-queued events as JSON
200
266
  ```
201
267
 
202
- ## Shipped skills
203
-
204
- Skills are Markdown prompts that tell your AI client when and how to call each endpoint. `mmagent install-skill` inlines the shared auth/polling patterns at install time.
205
-
206
- | Skill | Target endpoint |
207
- |---|---|
208
- | `multi-model-agent` | Overview + skill map (read first to pick the right `mma-*` skill) |
209
- | `mma-delegate` | `POST /delegate` |
210
- | `mma-audit` | `POST /audit` |
211
- | `mma-review` | `POST /review` |
212
- | `mma-verify` | `POST /verify` |
213
- | `mma-debug` | `POST /debug` |
214
- | `mma-execute-plan` | `POST /execute-plan` |
215
- | `mma-retry` | `POST /retry` |
216
- | `mma-investigate` | `POST /investigate` |
217
- | `mma-context-blocks` | `POST/DELETE /context-blocks` |
218
- | `mma-clarifications` | `POST /clarifications/confirm` |
219
-
220
268
  ## Architecture
221
269
 
222
270
  `mmagent serve` runs a loopback HTTP server. Each tool call dispatches to a labor agent (standard or complex), runs a cross-agent review cycle, and returns a structured report. Tasks run in parallel; each has a cost ceiling and wall-clock timeout.
@@ -237,7 +285,7 @@ Full design rationale: [DIRECTION.md](https://github.com/zhixuan312/multi-model-
237
285
 
238
286
  ## What's new
239
287
 
240
- Latest: **3.8.0** — read-only reviewed lifecycle: all 5 read-only routes (audit, review, verify, investigate, debug) now run a single `quality_only` review with bounded rework, structured `findings[]` worker output, and forced cross-tier review (worker complex, reviewer standard). Verify worker tier upgraded to complex. `MMAGENT_READ_ONLY_REVIEW` kill switch for rollback. Full history: [CHANGELOG](https://github.com/zhixuan312/multi-model-agent/blob/master/CHANGELOG.md).
288
+ Latest: **3.8.1** — read-only review becomes annotation, not gating. The 5 read-only routes (audit, review, verify, investigate, debug) now run a single reviewer pass that annotates each worker finding with `reviewerConfidence` (0-100) and an optional `reviewerSeverity` correction — no rework loop, restoring 3.7.0-comparable wall-clock. `Finding` schema simplified (drop `file`/`line`/`sourceQuote`; required `evidence`; rename `suggestedFix` `suggestion`). Full history: [CHANGELOG](https://github.com/zhixuan312/multi-model-agent/blob/master/CHANGELOG.md).
241
289
 
242
290
  ## Full documentation
243
291
 
@@ -3,34 +3,46 @@
3
3
  ### POST /<tool>?cwd=<abs> — dispatch response (202)
4
4
 
5
5
  ```json
6
- {
7
- "batchId": "<uuid>",
8
- "state": "pending"
9
- }
6
+ { "batchId": "<uuid>", "statusUrl": "/batch/<uuid>" }
10
7
  ```
11
8
 
9
+ Use `batchId` to poll. `statusUrl` is a convenience pointer.
10
+
12
11
  ### GET /batch/:id — polling response
13
12
 
13
+ The HTTP status is the state discriminator:
14
+
15
+ | Status | Meaning |
16
+ |---|---|
17
+ | `202 text/plain` | Still pending — body is the running headline string (e.g. `"1/2 running, 47s elapsed"`) |
18
+ | `200 application/json` | Terminal — body is the uniform 7-field envelope below |
19
+ | `404` / `401` / `5xx` | Error — see Error response below; stop polling |
20
+
21
+ The terminal JSON envelope always has these 7 fields. Each may be a real value or a `not_applicable` sentinel:
22
+
14
23
  ```json
15
24
  {
16
- "batchId": "<uuid>",
17
- "state": "pending | running | awaiting_clarification | complete | failed | expired",
18
- "proposedInterpretation": "<string>",
19
- "results": [ ... ],
20
25
  "headline": "<string>",
21
- "batchTimings": { ... },
22
- "costSummary": { ... }
26
+ "results": [ /* per-task result objects */ ],
27
+ "batchTimings": { /* timings */ },
28
+ "costSummary": { /* cost roll-up */ },
29
+ "structuredReport": { /* parsed sections */ },
30
+ "error": { "kind": "not_applicable", "reason": "batch succeeded" },
31
+ "proposedInterpretation": { "kind": "not_applicable", "reason": "batch not awaiting clarification" }
23
32
  }
24
33
  ```
25
34
 
26
- `proposedInterpretation` is only present when `state` is `awaiting_clarification`.
35
+ Read the envelope by the shape of `error` and `proposedInterpretation`:
27
36
 
28
- `results`, `headline`, `batchTimings`, and `costSummary` are only present
29
- when `state` is `complete` or `failed`.
37
+ | Shape | Meaning |
38
+ |---|---|
39
+ | `error` is a real object (with `code` / `message`) | Batch failed — read `error.code` + `error.message` |
40
+ | `proposedInterpretation` is a string | Batch is awaiting clarification — invoke `mma-clarifications` |
41
+ | Both are `{kind: "not_applicable", ...}` sentinels | Batch succeeded — read `results` |
30
42
 
31
43
  ### GET /batch/:id?taskIndex=N — single task slice
32
44
 
33
- Returns the same shape but `results` contains only the task at index `N`.
45
+ Same 7-field envelope. `results` contains exactly the task at index `N`. Returns `404 unknown_task_index` if `N` is out of range.
34
46
 
35
47
  ### Error response (4xx / 5xx)
36
48
 
@@ -38,9 +50,8 @@ Returns the same shape but `results` contains only the task at index `N`.
38
50
  {
39
51
  "error": "<code>",
40
52
  "message": "<human-readable>",
41
- "details": { ... }
53
+ "details": { /* optional structured context, e.g. fieldErrors for 400 */ }
42
54
  }
43
55
  ```
44
56
 
45
- `details` is optional and present only when the server has structured
46
- additional context (e.g. `fieldErrors` for validation failures).
57
+ `details` is optional and present only when the server has structured additional context.
@@ -8,7 +8,7 @@ when_to_use: >-
8
8
  User asks for a doc/spec/config audit OR a methodology skill
9
9
  (superpowers:dispatching-parallel-agents, /security-review) points at one AND
10
10
  mmagent is running. Audit on PROSE/SPEC docs — use mma-review for source code.
11
- version: 3.8.0
11
+ version: 3.8.1
12
12
  ---
13
13
 
14
14
  # mma-audit
@@ -72,19 +72,26 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
72
72
 
73
73
  @include _shared/response-shape.md
74
74
 
75
- ## Reading the review verdicts
75
+ ## Reading the review verdicts (annotation model — 3.8.1+)
76
76
 
77
- The terminal envelope now includes:
77
+ The terminal envelope includes:
78
78
  - `specReviewVerdict: 'not_applicable'` — read-only routes have no spec review stage.
79
- - `qualityReviewVerdict` — verdict from the cross-agent quality review.
80
- - `roundsUsed` — number of worker attempts (`1` = approved on first try; `2`+ = rework rounds; `0` = review topology disabled via env var).
79
+ - `qualityReviewVerdict` — outcome of the single annotation pass.
80
+ - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
81
+
82
+ There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
81
83
 
82
84
  Action per `qualityReviewVerdict`:
83
- - `'approved'` — findings are grounded; act on them.
84
- - `'changes_required'` — the worker reworked but couldn't fully satisfy the reviewer at the rework cap. Drill into individually flagged findings before acting.
85
- - `'concerns'` — non-blocking issues raised; proceed but read the per-finding feedback.
86
- - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW`) disabled review for this route. Treat output as today.
87
- - `'error'` — reviewer call failed (transport, rate-limit). No attestation; fall back to caution.
85
+ - `'annotated'` — every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
86
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_AUDIT=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
87
+ - `'error'` — reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
88
+
89
+ ### Per-finding reviewer fields
90
+
91
+ Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
92
+
93
+ - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
94
+ - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
88
95
 
89
96
  ## Best practices
90
97
 
@@ -12,7 +12,7 @@ when_to_use: >-
12
12
  `proposedInterpretation` is a hard gate — the batch is paused, not
13
13
  informational. The batch will not complete until the caller responds. Treating
14
14
  it as advisory is the clarification-as-info anti-pattern (AP5).
15
- version: 3.8.0
15
+ version: 3.8.1
16
16
  ---
17
17
 
18
18
  # mma-clarifications
@@ -12,7 +12,7 @@ when_to_use: >-
12
12
  Register once here, then pass the ID via `contextBlockIds` on mma-delegate /
13
13
  mma-execute-plan / mma-audit / mma-review / mma-verify / mma-debug /
14
14
  mma-investigate. Cheaper and faster than inlining the same content N times.
15
- version: 3.8.0
15
+ version: 3.8.1
16
16
  ---
17
17
 
18
18
  # mma-context-blocks
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  read files, reproduce, trace — OR a methodology skill
11
11
  (superpowers:systematic-debugging) points at the investigation step. Delegate
12
12
  the read/reproduce/trace; the main agent stays on the hypothesis and the fix.
13
- version: 3.8.0
13
+ version: 3.8.1
14
14
  ---
15
15
 
16
16
  # mma-debug
@@ -78,19 +78,26 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
78
78
 
79
79
  @include _shared/response-shape.md
80
80
 
81
- ## Reading the review verdicts
81
+ ## Reading the review verdicts (annotation model — 3.8.1+)
82
82
 
83
- The terminal envelope now includes:
83
+ The terminal envelope includes:
84
84
  - `specReviewVerdict: 'not_applicable'` — read-only routes have no spec review stage.
85
- - `qualityReviewVerdict` — verdict from the cross-agent quality review.
86
- - `roundsUsed` — number of worker attempts (`1` = approved on first try; `2`+ = rework rounds; `0` = review topology disabled via env var).
85
+ - `qualityReviewVerdict` — outcome of the single annotation pass.
86
+ - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
87
+
88
+ There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
87
89
 
88
90
  Action per `qualityReviewVerdict`:
89
- - `'approved'` — findings are grounded; act on them.
90
- - `'changes_required'` — the worker reworked but couldn't fully satisfy the reviewer at the rework cap. Drill into individually flagged findings before acting.
91
- - `'concerns'` — non-blocking issues raised; proceed but read the per-finding feedback.
92
- - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW`) disabled review for this route. Treat output as today.
93
- - `'error'` — reviewer call failed (transport, rate-limit). No attestation; fall back to caution.
91
+ - `'annotated'` — every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
92
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_DEBUG=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
93
+ - `'error'` — reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
94
+
95
+ ### Per-finding reviewer fields
96
+
97
+ Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
98
+
99
+ - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
100
+ - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
94
101
 
95
102
  ## Best practices
96
103
 
@@ -11,7 +11,7 @@ when_to_use: >-
11
11
  and keep main context free. If a plan file exists → use mma-execute-plan. If
12
12
  the task is audit / review / verify / debug / investigate → use the matching
13
13
  specialized skill.
14
- version: 3.8.0
14
+ version: 3.8.1
15
15
  ---
16
16
 
17
17
  # mma-delegate
@@ -65,6 +65,7 @@ Dispatch one or more ad-hoc tasks to workers concurrently. Each task is an indep
65
65
  | `tasks[].filePaths` | string[] | no | Files the worker focuses on |
66
66
  | `tasks[].done` | string | no | Acceptance criteria |
67
67
  | `tasks[].contextBlockIds` | string[] | no | IDs from `mma-context-blocks` |
68
+ | `tasks[].maxCostUSD` | number | no | Per-task cost cap in USD (positive finite). Default 10 when omitted. |
68
69
  | `tasks[].verifyCommand` | string[] | no | See verify-and-review snippet below |
69
70
  | `tasks[].reviewPolicy` | `"full"` / `"spec_only"` / `"diff_only"` / `"off"` | no | See verify-and-review snippet below. Default `"full"` |
70
71
 
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  superpowers:subagent-driven-development / superpowers:executing-plans —
11
11
  workers are cheaper and don't pollute main context. Task descriptors must
12
12
  match plan headings verbatim.
13
- version: 3.8.0
13
+ version: 3.8.1
14
14
  ---
15
15
 
16
16
  # mma-execute-plan
@@ -52,8 +52,7 @@ Dispatch named tasks from a plan file to workers. Each `tasks` string must match
52
52
  "/project/docs/plan.md",
53
53
  "/project/src/auth/login.ts"
54
54
  ],
55
- "contextBlockIds": [],
56
- "agentType": "standard"
55
+ "contextBlockIds": []
57
56
  }
58
57
  ```
59
58
 
@@ -63,12 +62,14 @@ Dispatch named tasks from a plan file to workers. Each `tasks` string must match
63
62
  | `context` | string | no | Short additional context not in the plan |
64
63
  | `filePaths` | string[] | no | Plan file + relevant source files. Required: the plan file itself. |
65
64
  | `contextBlockIds` | string[] | no | IDs from `mma-context-blocks` |
66
- | `agentType` | `"standard"` / `"complex"` | no | Default `"standard"`. Use `"complex"` for tasks too large for the standard tier — reads many files, produces many edits, or the last run came back with `filesWritten: 0`. |
65
+ | `maxCostUSD` | number | no | Per-task cost cap in USD (positive finite). Default 10 when omitted. |
67
66
  | `verifyCommand` | string[] | no | See verify-and-review snippet below |
68
67
  | `tasks[].reviewPolicy` | `"full"` / `"spec_only"` / `"diff_only"` / `"off"` | no | See verify-and-review snippet below. Default `"full"`. |
69
68
 
70
69
  @include _shared/verify-and-review.md
71
70
 
71
+ > **No `agentType` here.** Worker tier is set by the plan and per-route defaults. For ad-hoc work where you need direct tier control, use `mma-delegate`.
72
+
72
73
  If the batch reaches `awaiting_clarification`, use `mma-clarifications` to confirm or correct the proposed interpretation.
73
74
 
74
75
  ## Full example
@@ -12,7 +12,7 @@ when_to_use: >-
12
12
  git-history queries. OR you are about to read 3+ files / run any grep in main
13
13
  context — that's the inline-labor-leakage anti-pattern (AP2); delegate to this
14
14
  skill instead.
15
- version: 3.8.0
15
+ version: 3.8.1
16
16
  ---
17
17
 
18
18
  # mma-investigate
@@ -76,9 +76,10 @@ digraph when_to_use {
76
76
  | `question` | string | yes | Natural-language investigation question |
77
77
  | `filePaths` | string[] | no | Anchor paths the worker starts from. Worker may grep beyond. |
78
78
  | `contextBlockIds` | string[] | no | IDs from `mma-context-blocks` — enables follow-up / delta investigation |
79
- | `agentType` | `'standard' \| 'complex'` | no | Caller override of the route default (`'complex'`) |
80
79
  | `tools` | `'none' \| 'readonly'` | no | Default `'readonly'`. `'no-shell'` and `'full'` are rejected — investigation is read-only |
81
80
 
81
+ > Worker tier for `mma-investigate` is hardcoded to `complex` and is not caller-configurable. Sending `agentType` is rejected with HTTP 400.
82
+
82
83
  **Anchor narrow questions with `filePaths`:**
83
84
 
84
85
  ❌ `{ "question": "Where is parseConfig called?" }` — searches the whole repo
@@ -123,19 +124,26 @@ Each task carries an `investigation` field on its per-task report:
123
124
 
124
125
  `workerStatus` is one of `done`, `done_with_concerns`, `needs_context`, `blocked`. When `done_with_concerns`, the per-task report carries `incompleteReason` (`turn_cap`, `cost_cap`, `timeout`, or `missing_sections`). When `needs_context`, the worker flagged a `[needs_context]` bullet under `## Unresolved` — re-dispatch with extra context (anchor paths, a context block, or a clarification turn).
125
126
 
126
- ## Reading the review verdicts
127
+ ## Reading the review verdicts (annotation model — 3.8.1+)
127
128
 
128
- The terminal envelope now includes:
129
+ The terminal envelope includes:
129
130
  - `specReviewVerdict: 'not_applicable'` — read-only routes have no spec review stage.
130
- - `qualityReviewVerdict` — verdict from the cross-agent quality review.
131
- - `roundsUsed` — number of worker attempts (`1` = approved on first try; `2`+ = rework rounds; `0` = review topology disabled via env var).
131
+ - `qualityReviewVerdict` — outcome of the single annotation pass.
132
+ - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
133
+
134
+ There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
132
135
 
133
136
  Action per `qualityReviewVerdict`:
134
- - `'approved'` — findings are grounded; act on them.
135
- - `'changes_required'` — the worker reworked but couldn't fully satisfy the reviewer at the rework cap. Drill into individually flagged findings before acting.
136
- - `'concerns'` — non-blocking issues raised; proceed but read the per-finding feedback.
137
- - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW`) disabled review for this route. Treat output as today.
138
- - `'error'` — reviewer call failed (transport, rate-limit). No attestation; fall back to caution.
137
+ - `'annotated'` — every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
138
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_INVESTIGATE=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
139
+ - `'error'` — reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
140
+
141
+ ### Per-finding reviewer fields
142
+
143
+ Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
144
+
145
+ - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
146
+ - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
139
147
 
140
148
  ## Best practices
141
149
 
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  you want to re-try the failed indices only. Prefer this over re-dispatching
11
11
  the whole batch or inline-retrying — it's idempotent and preserves the
12
12
  original batch's diagnostics.
13
- version: 3.8.0
13
+ version: 3.8.1
14
14
  ---
15
15
 
16
16
  # mma-retry
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  AND mmagent is running. Delegate so each file reviews on its own worker; the
11
11
  main agent only decides what to merge. Review on SOURCE CODE — use mma-audit
12
12
  for prose specs / configs.
13
- version: 3.8.0
13
+ version: 3.8.1
14
14
  ---
15
15
 
16
16
  # mma-review
@@ -75,19 +75,26 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
75
75
 
76
76
  @include _shared/response-shape.md
77
77
 
78
- ## Reading the review verdicts
78
+ ## Reading the review verdicts (annotation model — 3.8.1+)
79
79
 
80
- The terminal envelope now includes:
80
+ The terminal envelope includes:
81
81
  - `specReviewVerdict: 'not_applicable'` — read-only routes have no spec review stage.
82
- - `qualityReviewVerdict` — verdict from the cross-agent quality review.
83
- - `roundsUsed` — number of worker attempts (`1` = approved on first try; `2`+ = rework rounds; `0` = review topology disabled via env var).
82
+ - `qualityReviewVerdict` — outcome of the single annotation pass.
83
+ - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
84
+
85
+ There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
84
86
 
85
87
  Action per `qualityReviewVerdict`:
86
- - `'approved'` — findings are grounded; act on them.
87
- - `'changes_required'` — the worker reworked but couldn't fully satisfy the reviewer at the rework cap. Drill into individually flagged findings before acting.
88
- - `'concerns'` — non-blocking issues raised; proceed but read the per-finding feedback.
89
- - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW`) disabled review for this route. Treat output as today.
90
- - `'error'` — reviewer call failed (transport, rate-limit). No attestation; fall back to caution.
88
+ - `'annotated'` — every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
89
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_REVIEW=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
90
+ - `'error'` — reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
91
+
92
+ ### Per-finding reviewer fields
93
+
94
+ Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
95
+
96
+ - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
97
+ - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
91
98
 
92
99
  ## Best practices
93
100
 
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  against implemented work BEFORE claiming success. Delegate so each checklist
11
11
  item gets independent evidence-gathering on a worker. Use this BEFORE saying
12
12
  "done" — never after.
13
- version: 3.8.0
13
+ version: 3.8.1
14
14
  ---
15
15
 
16
16
  # mma-verify
@@ -76,19 +76,26 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
76
76
 
77
77
  @include _shared/response-shape.md
78
78
 
79
- ## Reading the review verdicts
79
+ ## Reading the review verdicts (annotation model — 3.8.1+)
80
80
 
81
- The terminal envelope now includes:
81
+ The terminal envelope includes:
82
82
  - `specReviewVerdict: 'not_applicable'` — read-only routes have no spec review stage.
83
- - `qualityReviewVerdict` — verdict from the cross-agent quality review.
84
- - `roundsUsed` — number of worker attempts (`1` = approved on first try; `2`+ = rework rounds; `0` = review topology disabled via env var).
83
+ - `qualityReviewVerdict` — outcome of the single annotation pass.
84
+ - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
85
+
86
+ There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
85
87
 
86
88
  Action per `qualityReviewVerdict`:
87
- - `'approved'` — findings are grounded; act on them.
88
- - `'changes_required'` — the worker reworked but couldn't fully satisfy the reviewer at the rework cap. Drill into individually flagged findings before acting.
89
- - `'concerns'` — non-blocking issues raised; proceed but read the per-finding feedback.
90
- - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW`) disabled review for this route. Treat output as today.
91
- - `'error'` — reviewer call failed (transport, rate-limit). No attestation; fall back to caution.
89
+ - `'annotated'` — every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
90
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_VERIFY=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
91
+ - `'error'` — reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
92
+
93
+ ### Per-finding reviewer fields
94
+
95
+ Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
96
+
97
+ - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
98
+ - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
92
99
 
93
100
  ## Best practices
94
101
 
@@ -11,7 +11,7 @@ when_to_use: >-
11
11
  tasks — AND mmagent is running. Read this once, pick the matching mma-* skill,
12
12
  and delegate there. Applies equally whether the user invoked a superpowers
13
13
  methodology skill or asked directly.
14
- version: 3.8.0
14
+ version: 3.8.1
15
15
  ---
16
16
 
17
17
  # multi-model-agent (router)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@zhixuan92/multi-model-agent",
3
- "version": "3.8.0",
3
+ "version": "3.8.1",
4
4
  "type": "module",
5
5
  "license": "MIT",
6
6
  "description": "Standalone HTTP server for multi-model-agent. Routes tool-invocation work to Claude, Codex, or OpenAI-compatible sub-agents with async-polling REST dispatch and installable skills for Claude Code, Gemini CLI, Codex CLI, and Cursor.",
@@ -52,7 +52,7 @@
52
52
  },
53
53
  "dependencies": {
54
54
  "@asteasolutions/zod-to-openapi": "^8.5.0",
55
- "@zhixuan92/multi-model-agent-core": "^3.8.0",
55
+ "@zhixuan92/multi-model-agent-core": "^3.8.1",
56
56
  "gray-matter": "^4.0.3",
57
57
  "minimist": "^1.2.8",
58
58
  "proper-lockfile": "^4.1.2",