@zhixuan92/multi-model-agent 3.10.3 → 3.10.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -84,7 +84,7 @@ Two ways — pick one:
84
84
 
85
85
  ```bash
86
86
  mmagent serve # 127.0.0.1:7337 by default
87
- curl -s http://localhost:7337/health # → {"ok":true,"version":"3.10.3",...}
87
+ curl -s http://localhost:7337/health # → {"ok":true,"version":"3.10.4",...}
88
88
  ```
89
89
 
90
90
  For an always-on background install (survives reboots): [launchd / systemd templates](./scripts/README.md).
@@ -287,7 +287,7 @@ Full design rationale: [DIRECTION.md](https://github.com/zhixuan312/multi-model-
287
287
 
288
288
  ## What's new
289
289
 
290
- Latest: **3.10.3** — critical telemetry hotfix + UX polish. **3.10.2 silently dropped every emitted event** because the new emit-time R4 validation tripped on a 1ms clock-skew between `runResult.durationMs` and per-stage `durationMs`. Fixed by enforcing R4 by construction in the event builder. Also: skill manifest backfill on upgrade (resolves the "mma-investigate not registered" bug for users who upgraded across v3.4.0); live-elapsed polling headlines (no more stale repeated elapsed between heartbeats); tiered server stdout (default-quiet, `--verbose` for full firehose with throttled heartbeats). Full history: [CHANGELOG](https://github.com/zhixuan312/multi-model-agent/blob/master/CHANGELOG.md).
290
+ Latest: **3.10.4** — review stages were recording the implementer's model (V3 R3 violation root cause). Now record the actual reviewer's resolved tier + model. Plus: telemetry validation is fully warn-only events never drop, and cross-field warnings now include actual offending values (model, tokens, totals) so config issues vs lifecycle bugs are distinguishable at a glance. Full history: [CHANGELOG](https://github.com/zhixuan312/multi-model-agent/blob/master/CHANGELOG.md).
291
291
 
292
292
  ## Full documentation
293
293
 
@@ -8,7 +8,7 @@ when_to_use: >-
8
8
  User asks for a doc/spec/config audit OR a methodology skill
9
9
  (superpowers:dispatching-parallel-agents, /security-review) points at one AND
10
10
  mmagent is running. Audit on PROSE/SPEC docs — use mma-review for source code.
11
- version: 3.10.3
11
+ version: 3.10.5
12
12
  ---
13
13
 
14
14
  # mma-audit
@@ -72,26 +72,42 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
72
72
 
73
73
  @include _shared/response-shape.md
74
74
 
75
- ## Reading the review verdicts (annotation model — 3.8.1+)
76
-
77
- The terminal envelope includes:
78
- - `specReviewVerdict: 'not_applicable'` read-only routes have no spec review stage.
79
- - `qualityReviewVerdict` outcome of the single annotation pass.
80
- - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
81
-
82
- There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
83
-
84
- Action per `qualityReviewVerdict`:
85
- - `'annotated'` every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
86
- - `'skipped'` kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_AUDIT=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
87
- - `'error'` reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
88
-
89
- ### Per-finding reviewer fields
90
-
91
- Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
92
-
93
- - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
94
- - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
75
+ ## Reading the findings (3.10.5+)
76
+
77
+ The terminal envelope's `results[N].annotatedFindings` is a list of structured
78
+ findings the reviewer extracted and scored from the implementer's narrative.
79
+ Every finding has the same shape:
80
+
81
+ | Field | Type | Notes |
82
+ |---|---|---|
83
+ | `id` | string | Reviewer-assigned, e.g. `F1`, `F2`. |
84
+ | `severity` | `'critical' \| 'high' \| 'medium' \| 'low'` | 4-tier. |
85
+ | `claim` | string | One-sentence summary. |
86
+ | `evidence` | string ≥20 chars | Quoted from worker output when grounded. |
87
+ | `suggestion?` | string | Optional fix recommendation. |
88
+ | `reviewerConfidence` | `number \| null` | 0–100 from the reviewer; `null` when emitted via deterministic fallback. |
89
+ | `evidenceGrounded` | boolean | True when `evidence` is a verbatim substring of worker output. |
90
+
91
+ ### Verdict states (`qualityReviewVerdict`)
92
+
93
+ - `'annotated'` every finding is structured. May be reviewer-emitted (with
94
+ numeric `reviewerConfidence`) or deterministic-fallback (with
95
+ `reviewerConfidence: null`). The route ALWAYS reaches `'annotated'` unless
96
+ the reviewer call itself fails transport.
97
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled`).
98
+ - `'error'` — only when the reviewer call fails transport (network / 5xx).
99
+
100
+ ### Recommended rendering by the main agent
101
+
102
+ 1. Show ALL findings — never silently drop. Confidence and grounding are
103
+ soft signals, not gates.
104
+ 2. Default sort: severity (critical → low) then `reviewerConfidence` desc
105
+ (nulls last).
106
+ 3. `severity` is the reviewer's authoritative final value — use it directly.
107
+ 4. Mark findings with `evidenceGrounded: false` or
108
+ `reviewerConfidence < 70` as "lower-trust" (collapsed section, lighter
109
+ color, or `(low confidence)` annotation). User decides what to do.
110
+ 5. Severity-tier counts feed the dashboard via V3 `findingsBySeverity`.
95
111
 
96
112
  ## Best practices
97
113
 
@@ -12,7 +12,7 @@ when_to_use: >-
12
12
  `proposedInterpretation` is a hard gate — the batch is paused, not
13
13
  informational. The batch will not complete until the caller responds. Treating
14
14
  it as advisory is the clarification-as-info anti-pattern (AP5).
15
- version: 3.10.3
15
+ version: 3.10.5
16
16
  ---
17
17
 
18
18
  # mma-clarifications
@@ -12,7 +12,7 @@ when_to_use: >-
12
12
  Register once here, then pass the ID via `contextBlockIds` on mma-delegate /
13
13
  mma-execute-plan / mma-audit / mma-review / mma-verify / mma-debug /
14
14
  mma-investigate. Cheaper and faster than inlining the same content N times.
15
- version: 3.10.3
15
+ version: 3.10.5
16
16
  ---
17
17
 
18
18
  # mma-context-blocks
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  read files, reproduce, trace — OR a methodology skill
11
11
  (superpowers:systematic-debugging) points at the investigation step. Delegate
12
12
  the read/reproduce/trace; the main agent stays on the hypothesis and the fix.
13
- version: 3.10.3
13
+ version: 3.10.5
14
14
  ---
15
15
 
16
16
  # mma-debug
@@ -78,26 +78,42 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
78
78
 
79
79
  @include _shared/response-shape.md
80
80
 
81
- ## Reading the review verdicts (annotation model — 3.8.1+)
82
-
83
- The terminal envelope includes:
84
- - `specReviewVerdict: 'not_applicable'` read-only routes have no spec review stage.
85
- - `qualityReviewVerdict` outcome of the single annotation pass.
86
- - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
87
-
88
- There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
89
-
90
- Action per `qualityReviewVerdict`:
91
- - `'annotated'` every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
92
- - `'skipped'` kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_DEBUG=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
93
- - `'error'` reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
94
-
95
- ### Per-finding reviewer fields
96
-
97
- Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
98
-
99
- - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
100
- - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
81
+ ## Reading the findings (3.10.5+)
82
+
83
+ The terminal envelope's `results[N].annotatedFindings` is a list of structured
84
+ findings the reviewer extracted and scored from the implementer's narrative.
85
+ Every finding has the same shape:
86
+
87
+ | Field | Type | Notes |
88
+ |---|---|---|
89
+ | `id` | string | Reviewer-assigned, e.g. `F1`, `F2`. |
90
+ | `severity` | `'critical' \| 'high' \| 'medium' \| 'low'` | 4-tier. |
91
+ | `claim` | string | One-sentence summary. |
92
+ | `evidence` | string ≥20 chars | Quoted from worker output when grounded. |
93
+ | `suggestion?` | string | Optional fix recommendation. |
94
+ | `reviewerConfidence` | `number \| null` | 0–100 from the reviewer; `null` when emitted via deterministic fallback. |
95
+ | `evidenceGrounded` | boolean | True when `evidence` is a verbatim substring of worker output. |
96
+
97
+ ### Verdict states (`qualityReviewVerdict`)
98
+
99
+ - `'annotated'` every finding is structured. May be reviewer-emitted (with
100
+ numeric `reviewerConfidence`) or deterministic-fallback (with
101
+ `reviewerConfidence: null`). The route ALWAYS reaches `'annotated'` unless
102
+ the reviewer call itself fails transport.
103
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled`).
104
+ - `'error'` — only when the reviewer call fails transport (network / 5xx).
105
+
106
+ ### Recommended rendering by the main agent
107
+
108
+ 1. Show ALL findings — never silently drop. Confidence and grounding are
109
+ soft signals, not gates.
110
+ 2. Default sort: severity (critical → low) then `reviewerConfidence` desc
111
+ (nulls last).
112
+ 3. `severity` is the reviewer's authoritative final value — use it directly.
113
+ 4. Mark findings with `evidenceGrounded: false` or
114
+ `reviewerConfidence < 70` as "lower-trust" (collapsed section, lighter
115
+ color, or `(low confidence)` annotation). User decides what to do.
116
+ 5. Severity-tier counts feed the dashboard via V3 `findingsBySeverity`.
101
117
 
102
118
  ## Best practices
103
119
 
@@ -11,7 +11,7 @@ when_to_use: >-
11
11
  and keep main context free. If a plan file exists → use mma-execute-plan. If
12
12
  the task is audit / review / verify / debug / investigate → use the matching
13
13
  specialized skill.
14
- version: 3.10.3
14
+ version: 3.10.5
15
15
  ---
16
16
 
17
17
  # mma-delegate
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  superpowers:subagent-driven-development / superpowers:executing-plans —
11
11
  workers are cheaper and don't pollute main context. Task descriptors must
12
12
  match plan headings verbatim.
13
- version: 3.10.3
13
+ version: 3.10.5
14
14
  ---
15
15
 
16
16
  # mma-execute-plan
@@ -12,7 +12,7 @@ when_to_use: >-
12
12
  git-history queries. OR you are about to read 3+ files / run any grep in main
13
13
  context — that's the inline-labor-leakage anti-pattern (AP2); delegate to this
14
14
  skill instead.
15
- version: 3.10.3
15
+ version: 3.10.5
16
16
  ---
17
17
 
18
18
  # mma-investigate
@@ -124,26 +124,42 @@ Each task carries an `investigation` field on its per-task report:
124
124
 
125
125
  `workerStatus` is one of `done`, `done_with_concerns`, `needs_context`, `blocked`. When `done_with_concerns`, the per-task report carries `incompleteReason` (`turn_cap`, `cost_cap`, `timeout`, or `missing_sections`). When `needs_context`, the worker flagged a `[needs_context]` bullet under `## Unresolved` — re-dispatch with extra context (anchor paths, a context block, or a clarification turn).
126
126
 
127
- ## Reading the review verdicts (annotation model — 3.8.1+)
128
-
129
- The terminal envelope includes:
130
- - `specReviewVerdict: 'not_applicable'` read-only routes have no spec review stage.
131
- - `qualityReviewVerdict` outcome of the single annotation pass.
132
- - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
133
-
134
- There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
135
-
136
- Action per `qualityReviewVerdict`:
137
- - `'annotated'` every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
138
- - `'skipped'` kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_INVESTIGATE=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
139
- - `'error'` reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
140
-
141
- ### Per-finding reviewer fields
142
-
143
- Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
144
-
145
- - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
146
- - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
127
+ ## Reading the findings (3.10.5+)
128
+
129
+ The terminal envelope's `results[N].annotatedFindings` is a list of structured
130
+ findings the reviewer extracted and scored from the implementer's narrative.
131
+ Every finding has the same shape:
132
+
133
+ | Field | Type | Notes |
134
+ |---|---|---|
135
+ | `id` | string | Reviewer-assigned, e.g. `F1`, `F2`. |
136
+ | `severity` | `'critical' \| 'high' \| 'medium' \| 'low'` | 4-tier. |
137
+ | `claim` | string | One-sentence summary. |
138
+ | `evidence` | string ≥20 chars | Quoted from worker output when grounded. |
139
+ | `suggestion?` | string | Optional fix recommendation. |
140
+ | `reviewerConfidence` | `number \| null` | 0–100 from the reviewer; `null` when emitted via deterministic fallback. |
141
+ | `evidenceGrounded` | boolean | True when `evidence` is a verbatim substring of worker output. |
142
+
143
+ ### Verdict states (`qualityReviewVerdict`)
144
+
145
+ - `'annotated'` every finding is structured. May be reviewer-emitted (with
146
+ numeric `reviewerConfidence`) or deterministic-fallback (with
147
+ `reviewerConfidence: null`). The route ALWAYS reaches `'annotated'` unless
148
+ the reviewer call itself fails transport.
149
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled`).
150
+ - `'error'` — only when the reviewer call fails transport (network / 5xx).
151
+
152
+ ### Recommended rendering by the main agent
153
+
154
+ 1. Show ALL findings — never silently drop. Confidence and grounding are
155
+ soft signals, not gates.
156
+ 2. Default sort: severity (critical → low) then `reviewerConfidence` desc
157
+ (nulls last).
158
+ 3. `severity` is the reviewer's authoritative final value — use it directly.
159
+ 4. Mark findings with `evidenceGrounded: false` or
160
+ `reviewerConfidence < 70` as "lower-trust" (collapsed section, lighter
161
+ color, or `(low confidence)` annotation). User decides what to do.
162
+ 5. Severity-tier counts feed the dashboard via V3 `findingsBySeverity`.
147
163
 
148
164
  ## Best practices
149
165
 
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  you want to re-try the failed indices only. Prefer this over re-dispatching
11
11
  the whole batch or inline-retrying — it's idempotent and preserves the
12
12
  original batch's diagnostics.
13
- version: 3.10.3
13
+ version: 3.10.5
14
14
  ---
15
15
 
16
16
  # mma-retry
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  AND mmagent is running. Delegate so each file reviews on its own worker; the
11
11
  main agent only decides what to merge. Review on SOURCE CODE — use mma-audit
12
12
  for prose specs / configs.
13
- version: 3.10.3
13
+ version: 3.10.5
14
14
  ---
15
15
 
16
16
  # mma-review
@@ -75,26 +75,42 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
75
75
 
76
76
  @include _shared/response-shape.md
77
77
 
78
- ## Reading the review verdicts (annotation model — 3.8.1+)
79
-
80
- The terminal envelope includes:
81
- - `specReviewVerdict: 'not_applicable'` read-only routes have no spec review stage.
82
- - `qualityReviewVerdict` outcome of the single annotation pass.
83
- - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
84
-
85
- There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
86
-
87
- Action per `qualityReviewVerdict`:
88
- - `'annotated'` every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
89
- - `'skipped'` kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_REVIEW=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
90
- - `'error'` reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
91
-
92
- ### Per-finding reviewer fields
93
-
94
- Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
95
-
96
- - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
97
- - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
78
+ ## Reading the findings (3.10.5+)
79
+
80
+ The terminal envelope's `results[N].annotatedFindings` is a list of structured
81
+ findings the reviewer extracted and scored from the implementer's narrative.
82
+ Every finding has the same shape:
83
+
84
+ | Field | Type | Notes |
85
+ |---|---|---|
86
+ | `id` | string | Reviewer-assigned, e.g. `F1`, `F2`. |
87
+ | `severity` | `'critical' \| 'high' \| 'medium' \| 'low'` | 4-tier. |
88
+ | `claim` | string | One-sentence summary. |
89
+ | `evidence` | string ≥20 chars | Quoted from worker output when grounded. |
90
+ | `suggestion?` | string | Optional fix recommendation. |
91
+ | `reviewerConfidence` | `number \| null` | 0–100 from the reviewer; `null` when emitted via deterministic fallback. |
92
+ | `evidenceGrounded` | boolean | True when `evidence` is a verbatim substring of worker output. |
93
+
94
+ ### Verdict states (`qualityReviewVerdict`)
95
+
96
+ - `'annotated'` every finding is structured. May be reviewer-emitted (with
97
+ numeric `reviewerConfidence`) or deterministic-fallback (with
98
+ `reviewerConfidence: null`). The route ALWAYS reaches `'annotated'` unless
99
+ the reviewer call itself fails transport.
100
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled`).
101
+ - `'error'` — only when the reviewer call fails transport (network / 5xx).
102
+
103
+ ### Recommended rendering by the main agent
104
+
105
+ 1. Show ALL findings — never silently drop. Confidence and grounding are
106
+ soft signals, not gates.
107
+ 2. Default sort: severity (critical → low) then `reviewerConfidence` desc
108
+ (nulls last).
109
+ 3. `severity` is the reviewer's authoritative final value — use it directly.
110
+ 4. Mark findings with `evidenceGrounded: false` or
111
+ `reviewerConfidence < 70` as "lower-trust" (collapsed section, lighter
112
+ color, or `(low confidence)` annotation). User decides what to do.
113
+ 5. Severity-tier counts feed the dashboard via V3 `findingsBySeverity`.
98
114
 
99
115
  ## Best practices
100
116
 
@@ -10,7 +10,7 @@ when_to_use: >-
10
10
  against implemented work BEFORE claiming success. Delegate so each checklist
11
11
  item gets independent evidence-gathering on a worker. Use this BEFORE saying
12
12
  "done" — never after.
13
- version: 3.10.3
13
+ version: 3.10.5
14
14
  ---
15
15
 
16
16
  # mma-verify
@@ -76,26 +76,42 @@ BATCH_ID=$(echo "$BATCH" | jq -r '.batchId')
76
76
 
77
77
  @include _shared/response-shape.md
78
78
 
79
- ## Reading the review verdicts (annotation model — 3.8.1+)
80
-
81
- The terminal envelope includes:
82
- - `specReviewVerdict: 'not_applicable'` read-only routes have no spec review stage.
83
- - `qualityReviewVerdict` outcome of the single annotation pass.
84
- - `roundsUsed` — `1` when reviewer ran (annotated or errored), `0` when reviewer was skipped.
85
-
86
- There is no rework loop. The reviewer annotates each finding in place and exits — never gates, never causes the worker to re-run.
87
-
88
- Action per `qualityReviewVerdict`:
89
- - `'annotated'` every finding in `findings[]` has `reviewerConfidence` (integer 0-100) and possibly `reviewerSeverity`. Sort or filter by confidence; treat low-confidence findings with skepticism.
90
- - `'skipped'` kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled` or per-route `MMAGENT_READ_ONLY_REVIEW_VERIFY=disabled`) bypassed the reviewer. Findings carry no reviewer fields; treat as raw worker output.
91
- - `'error'` reviewer call or response parsing failed. Findings have no reviewer fields; fall back to caution.
92
-
93
- ### Per-finding reviewer fields
94
-
95
- Every finding the worker emits has the standard fields (`id`, `severity`, `claim`, `evidence`, `suggestion?`). After a successful annotation pass, two more fields are added:
96
-
97
- - `reviewerConfidence` (integer 0-100): how confident the reviewer is that the finding is correct, on-brief, and grounded. Use as a filter (`>=70`) or a sort key for triage.
98
- - `reviewerSeverity?` (`'high' | 'medium' | 'low'`): only present when the reviewer disagrees with the worker's `severity`. Workers tend to inflate severity; use this to dial down. Trust `reviewerSeverity` over `severity` when present.
79
+ ## Reading the findings (3.10.5+)
80
+
81
+ The terminal envelope's `results[N].annotatedFindings` is a list of structured
82
+ findings the reviewer extracted and scored from the implementer's narrative.
83
+ Every finding has the same shape:
84
+
85
+ | Field | Type | Notes |
86
+ |---|---|---|
87
+ | `id` | string | Reviewer-assigned, e.g. `F1`, `F2`. |
88
+ | `severity` | `'critical' \| 'high' \| 'medium' \| 'low'` | 4-tier. |
89
+ | `claim` | string | One-sentence summary. |
90
+ | `evidence` | string ≥20 chars | Quoted from worker output when grounded. |
91
+ | `suggestion?` | string | Optional fix recommendation. |
92
+ | `reviewerConfidence` | `number \| null` | 0–100 from the reviewer; `null` when emitted via deterministic fallback. |
93
+ | `evidenceGrounded` | boolean | True when `evidence` is a verbatim substring of worker output. |
94
+
95
+ ### Verdict states (`qualityReviewVerdict`)
96
+
97
+ - `'annotated'` every finding is structured. May be reviewer-emitted (with
98
+ numeric `reviewerConfidence`) or deterministic-fallback (with
99
+ `reviewerConfidence: null`). The route ALWAYS reaches `'annotated'` unless
100
+ the reviewer call itself fails transport.
101
+ - `'skipped'` — kill switch (`MMAGENT_READ_ONLY_REVIEW=disabled`).
102
+ - `'error'` — only when the reviewer call fails transport (network / 5xx).
103
+
104
+ ### Recommended rendering by the main agent
105
+
106
+ 1. Show ALL findings — never silently drop. Confidence and grounding are
107
+ soft signals, not gates.
108
+ 2. Default sort: severity (critical → low) then `reviewerConfidence` desc
109
+ (nulls last).
110
+ 3. `severity` is the reviewer's authoritative final value — use it directly.
111
+ 4. Mark findings with `evidenceGrounded: false` or
112
+ `reviewerConfidence < 70` as "lower-trust" (collapsed section, lighter
113
+ color, or `(low confidence)` annotation). User decides what to do.
114
+ 5. Severity-tier counts feed the dashboard via V3 `findingsBySeverity`.
99
115
 
100
116
  ## Best practices
101
117
 
@@ -11,7 +11,7 @@ when_to_use: >-
11
11
  tasks — AND mmagent is running. Read this once, pick the matching mma-* skill,
12
12
  and delegate there. Applies equally whether the user invoked a superpowers
13
13
  methodology skill or asked directly.
14
- version: 3.10.3
14
+ version: 3.10.5
15
15
  ---
16
16
 
17
17
  # multi-model-agent (router)
@@ -67,29 +67,44 @@ function _buildRecorder(opts) {
67
67
  if (!d.enabled)
68
68
  return;
69
69
  const event = buildTaskCompletedEvent(ctx);
70
- // Validate against the BASE schema (caps, types, enums) these are
71
- // non-negotiable wire-format constraints. Drop only on those failures.
72
- // SuperRefine cross-field rules (R1–R11) are checked separately and
73
- // logged as warnings, but do NOT drop the event: backend uses
74
- // passthrough so degenerate rows can be filtered at query time, and
75
- // 3.10.2's drop-on-superRefine behaviour silently lost real telemetry
76
- // on 1ms clock-skew (R4) and other measurement-precision edge cases.
70
+ // Validation is INFORMATIONAL ONLY never block emit. Backend uses
71
+ // passthrough so it stores everything; if mma drops events here, that
72
+ // data is gone forever and the user has no visibility into what was
73
+ // suppressed. 3.10.2's drop-on-fail design hid real telemetry from
74
+ // both operator and dashboard.
77
75
  const baseParsed = TaskCompletedEventSchema.safeParse(event);
78
76
  if (!baseParsed.success) {
79
- console.warn('mma-telemetry: dropping schema-invalid event', {
77
+ console.warn('mma-telemetry: schema warning (event still emitted)', {
80
78
  eventId: event.eventId,
81
79
  issues: baseParsed.error.issues.map((e) => ({ path: e.path.join('.'), message: e.message })),
82
80
  });
83
- return;
84
81
  }
85
- const refined = ValidatedTaskCompletedEventSchema.safeParse(baseParsed.data);
82
+ const refined = ValidatedTaskCompletedEventSchema.safeParse(event);
86
83
  if (!refined.success) {
87
- console.warn('mma-telemetry: cross-field validation warning (event still emitted)', {
84
+ // Surface the actual offending values alongside the rule name so the
85
+ // operator can tell at a glance whether the cause is config (wrong
86
+ // values) or code (lifecycle bug). Tag the most informative
87
+ // top-level fields plus per-stage models for the common R3/R5/R6
88
+ // cross-field cases.
89
+ const stageModelsByName = (event.stages ?? []).reduce((acc, s) => {
90
+ if (s.name && s.model)
91
+ acc[s.name] = s.model;
92
+ return acc;
93
+ }, {});
94
+ console.warn('mma-telemetry: cross-field warning (event still emitted)', {
88
95
  eventId: event.eventId,
89
- issues: refined.error.issues.map((e) => ({ path: e.path.join('.'), message: e.message })),
96
+ implementerModel: event.implementerModel,
97
+ stageModels: stageModelsByName,
98
+ totalDurationMs: event.totalDurationMs,
99
+ inputTokens: event.inputTokens,
100
+ outputTokens: event.outputTokens,
101
+ issues: refined.error.issues.map((e) => ({
102
+ rule: e.message,
103
+ path: e.path.join('.'),
104
+ })),
90
105
  });
91
106
  }
92
- enqueue(baseParsed.data);
107
+ enqueue(event);
93
108
  }
94
109
  catch {
95
110
  dropped++;
@@ -1 +1 @@
1
- {"version":3,"file":"recorder.js","sourceRoot":"","sources":["../../src/telemetry/recorder.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,UAAU,EAAE,UAAU,EAAE,MAAM,SAAS,CAAC;AACjD,OAAO,EAAE,IAAI,EAAE,MAAM,WAAW,CAAC;AACjC,OAAO,EAAE,MAAM,EAAE,MAAM,cAAc,CAAC;AACtC,OAAO,EAAE,mBAAmB,EAAE,MAAM,eAAe,CAAC;AACpD,OAAO,EAAE,eAAe,EAAE,MAAM,iBAAiB,CAAC;AAClD,OAAO,EAAE,gBAAgB,EAAE,MAAM,mBAAmB,CAAC;AACrD,OAAO,EAAE,KAAK,EAAE,MAAM,YAAY,CAAC;AACnC,OAAO,EAAE,cAAc,EAAE,cAAc,EAAE,MAAM,iBAAiB,CAAC;AACjE,OAAO,EAAE,cAAc,EAAE,wBAAwB,EAAE,iCAAiC,EAAE,MAAM,mDAAmD,CAAC;AAChJ,OAAO,EACL,uBAAuB,GAExB,MAAM,2DAA2D,CAAC;AASnE,IAAI,SAAS,GAAoB,IAAI,CAAC;AAEtC,MAAM,UAAU,WAAW;IACzB,IAAI,CAAC,SAAS,EAAE,CAAC;QACf,MAAM,IAAI,KAAK,CAAC,sDAAsD,CAAC,CAAC;IAC1E,CAAC;IACD,OAAO,SAAS,CAAC;AACnB,CAAC;AAED,MAAM,UAAU,kBAAkB,CAAC,CAAW;IAC5C,SAAS,GAAG,CAAC,CAAC;AAChB,CAAC;AAED,MAAM,UAAU,cAAc,CAAC,IAAiD;IAC9E,MAAM,QAAQ,GAAG,cAAc,CAAC,IAAI,CAAC,CAAC;IACtC,SAAS,GAAG,QAAQ,CAAC;IACrB,OAAO,QAAQ,CAAC;AAClB,CAAC;AAED,SAAS,cAAc,CAAC,IAAiD;IACvE,MAAM,EAAE,OAAO,EAAE,cAAc,EAAE,GAAG,IAAI,CAAC;IACzC,MAAM,KAAK,GAAG,IAAI,KAAK,CAAC,OAAO,CAAC,CAAC;IACjC,MAAM,UAAU,GAAG,IAAI,eAAe,EAAE,CAAC;IACzC,IAAI,UAAU,GAAkB,IAAI,CAAC;IACrC,IAAI,OAAO,GAAG,CAAC,CAAC;IAEhB,MAAM,gBAAgB,GAAG,GAAW,EAAE;QACpC,IAAI,CAAC,UAAU,EAAE,CAAC;YAChB,UAAU,GAAG,mBAAmB,CAAC,OAAO,CAAC,CAAC,SAAS,CAAC;QACtD,CAAC;QACD,OAAO,UAAU,CAAC;IACpB,CAAC,CAAC;IAEF,MAAM,OAAO,GAAG,CAAC,KAA8B,EAAQ,EAAE;QACvD,IAAI,CAAC;YACH,MAAM,EAAE,GAAG,gBAAgB,EAAE,CAAC;YAC9B,MAAM,IAAI,GAAG,gBAAgB,CAAC,EAAE,SAAS,EAAE,EAAE,EAAE,cAAc,EAAE,CAAC,CAAC;YACjE,MAAM,GAAG,GAAG,cAAc,CAAC,OAAO,CAAC,CAAC;YAEpC,KAAK,CAAC,MAAM,CAAC;gBACX,aAAa,EAAE,cAAc;gBAC7B,SAAS,EAAE,IAAI,CAAC,SAAS;gBACzB,cAAc,EAAE,IAAI,CAAC,cAAc;gBACnC,EAAE,EAAE,IAAI,CAAC,EAAE;gBACX,SAAS,EAAE,IAAI,CAAC,SAAS;gBACzB,UAAU,EAAE,GAAG;gBACf,MAAM,EAAE,CAAC,KAAK,CAAC;aAChB,CAAC,CAAC,KAAK,CAAC,GAAG,EAAE;gBACZ,OAAO,EAAE,CAAC;YACZ,CAAC,CAAC,CAAC;QACL,CAAC;QAAC,MAAM,CAAC;YACP,OAAO,EAAE,CAAC;QACZ,CAAC;IACH,CAAC,CAAC;IAEF,OAAO;QACL,IAAI,MAAM;YACR,OAAO,UAAU,CAAC,MAAM,CAAC;QAC3B,CAAC;QAED,OAAO;QAEP,mBAAmB,CAAC,GAAG;YACrB,IAAI,CAAC;gBACH,MAAM,CAAC,GAAG,MAAM,CAAC,OAAO,CAAC,CAAC;gBAC1B,IAAI,CAAC,CAAC,CAAC,OAAO;oBAAE,OAAO;gBACvB,MAAM,KAAK,GAAG,uBAAuB,CAAC,GAAG,CAAC,CAAC;gBAC3C,oEAAoE;gBACpE,uEAAuE;gBACvE,oEAAoE;gBACpE,8DAA8D;gBAC9D,oEAAoE;gBACpE,sEAAsE;gBACtE,qEAAqE;gBACrE,MAAM,UAAU,GAAG,wBAAwB,CAAC,SAAS,CAAC,KAAK,CAAC,CAAC;gBAC7D,IAAI,CAAC,UAAU,CAAC,OAAO,EAAE,CAAC;oBACxB,OAAO,CAAC,IAAI,CAAC,8CAA8C,EAAE;wBAC3D,OAAO,EAAE,KAAK,CAAC,OAAO;wBACtB,MAAM,EAAE,UAAU,CAAC,KAAK,CAAC,MAAM,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,EAAE,IAAI,EAAE,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,GAAG,CAAC,EAAE,OAAO,EAAE,CAAC,CAAC,OAAO,EAAE,CAAC,CAAC;qBAC7F,CAAC,CAAC;oBACH,OAAO;gBACT,CAAC;gBACD,MAAM,OAAO,GAAG,iCAAiC,CAAC,SAAS,CAAC,UAAU,CAAC,IAAI,CAAC,CAAC;gBAC7E,IAAI,CAAC,OAAO,CAAC,OAAO,EAAE,CAAC;oBACrB,OAAO,CAAC,IAAI,CAAC,qEAAqE,EAAE;wBAClF,OAAO,EAAE,KAAK,CAAC,OAAO;wBACtB,MAAM,EAAE,OAAO,CAAC,KAAK,CAAC,MAAM,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,EAAE,IAAI,EAAE,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,GAAG,CAAC,EAAE,OAAO,EAAE,CAAC,CAAC,OAAO,EAAE,CAAC,CAAC;qBAC1F,CAAC,CAAC;gBACL,CAAC;gBACD,OAAO,CAAC,UAAU,CAAC,IAA+B,CAAC,CAAC;YACtD,CAAC;YAAC,MAAM,CAAC;gBACP,OAAO,EAAE,CAAC;YACZ,CAAC;QACH,CAAC;QAED,KAAK,CAAC,cAAc,CAAC,OAAO;YAC1B,MAAM,cAAc,CAAC,OAAO,CAAC,CAAC;YAC9B,UAAU,CAAC,KAAK,EAAE,CAAC;YACnB,MAAM,SAAS,GAAG,IAAI,CAAC,OAAO,EAAE,wBAAwB,CAAC,CAAC;YAC1D,IAAI,UAAU,CAAC,SAAS,CAAC;gBAAE,UAAU,CAAC,SAAS,CAAC,CAAC;YACjD,UAAU,GAAG,IAAI,CAAC;YAClB,IAAI,OAAO,EAAE,eAAe,EAAE,CAAC;gBAC7B,eAAe,CAAC,OAAO,CAAC,CAAC;YAC3B,CAAC;QACH,CAAC;KACF,CAAC;AACJ,CAAC"}
1
+ {"version":3,"file":"recorder.js","sourceRoot":"","sources":["../../src/telemetry/recorder.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,UAAU,EAAE,UAAU,EAAE,MAAM,SAAS,CAAC;AACjD,OAAO,EAAE,IAAI,EAAE,MAAM,WAAW,CAAC;AACjC,OAAO,EAAE,MAAM,EAAE,MAAM,cAAc,CAAC;AACtC,OAAO,EAAE,mBAAmB,EAAE,MAAM,eAAe,CAAC;AACpD,OAAO,EAAE,eAAe,EAAE,MAAM,iBAAiB,CAAC;AAClD,OAAO,EAAE,gBAAgB,EAAE,MAAM,mBAAmB,CAAC;AACrD,OAAO,EAAE,KAAK,EAAE,MAAM,YAAY,CAAC;AACnC,OAAO,EAAE,cAAc,EAAE,cAAc,EAAE,MAAM,iBAAiB,CAAC;AACjE,OAAO,EAAE,cAAc,EAAE,wBAAwB,EAAE,iCAAiC,EAAE,MAAM,mDAAmD,CAAC;AAChJ,OAAO,EACL,uBAAuB,GAExB,MAAM,2DAA2D,CAAC;AASnE,IAAI,SAAS,GAAoB,IAAI,CAAC;AAEtC,MAAM,UAAU,WAAW;IACzB,IAAI,CAAC,SAAS,EAAE,CAAC;QACf,MAAM,IAAI,KAAK,CAAC,sDAAsD,CAAC,CAAC;IAC1E,CAAC;IACD,OAAO,SAAS,CAAC;AACnB,CAAC;AAED,MAAM,UAAU,kBAAkB,CAAC,CAAW;IAC5C,SAAS,GAAG,CAAC,CAAC;AAChB,CAAC;AAED,MAAM,UAAU,cAAc,CAAC,IAAiD;IAC9E,MAAM,QAAQ,GAAG,cAAc,CAAC,IAAI,CAAC,CAAC;IACtC,SAAS,GAAG,QAAQ,CAAC;IACrB,OAAO,QAAQ,CAAC;AAClB,CAAC;AAED,SAAS,cAAc,CAAC,IAAiD;IACvE,MAAM,EAAE,OAAO,EAAE,cAAc,EAAE,GAAG,IAAI,CAAC;IACzC,MAAM,KAAK,GAAG,IAAI,KAAK,CAAC,OAAO,CAAC,CAAC;IACjC,MAAM,UAAU,GAAG,IAAI,eAAe,EAAE,CAAC;IACzC,IAAI,UAAU,GAAkB,IAAI,CAAC;IACrC,IAAI,OAAO,GAAG,CAAC,CAAC;IAEhB,MAAM,gBAAgB,GAAG,GAAW,EAAE;QACpC,IAAI,CAAC,UAAU,EAAE,CAAC;YAChB,UAAU,GAAG,mBAAmB,CAAC,OAAO,CAAC,CAAC,SAAS,CAAC;QACtD,CAAC;QACD,OAAO,UAAU,CAAC;IACpB,CAAC,CAAC;IAEF,MAAM,OAAO,GAAG,CAAC,KAA8B,EAAQ,EAAE;QACvD,IAAI,CAAC;YACH,MAAM,EAAE,GAAG,gBAAgB,EAAE,CAAC;YAC9B,MAAM,IAAI,GAAG,gBAAgB,CAAC,EAAE,SAAS,EAAE,EAAE,EAAE,cAAc,EAAE,CAAC,CAAC;YACjE,MAAM,GAAG,GAAG,cAAc,CAAC,OAAO,CAAC,CAAC;YAEpC,KAAK,CAAC,MAAM,CAAC;gBACX,aAAa,EAAE,cAAc;gBAC7B,SAAS,EAAE,IAAI,CAAC,SAAS;gBACzB,cAAc,EAAE,IAAI,CAAC,cAAc;gBACnC,EAAE,EAAE,IAAI,CAAC,EAAE;gBACX,SAAS,EAAE,IAAI,CAAC,SAAS;gBACzB,UAAU,EAAE,GAAG;gBACf,MAAM,EAAE,CAAC,KAAK,CAAC;aAChB,CAAC,CAAC,KAAK,CAAC,GAAG,EAAE;gBACZ,OAAO,EAAE,CAAC;YACZ,CAAC,CAAC,CAAC;QACL,CAAC;QAAC,MAAM,CAAC;YACP,OAAO,EAAE,CAAC;QACZ,CAAC;IACH,CAAC,CAAC;IAEF,OAAO;QACL,IAAI,MAAM;YACR,OAAO,UAAU,CAAC,MAAM,CAAC;QAC3B,CAAC;QAED,OAAO;QAEP,mBAAmB,CAAC,GAAG;YACrB,IAAI,CAAC;gBACH,MAAM,CAAC,GAAG,MAAM,CAAC,OAAO,CAAC,CAAC;gBAC1B,IAAI,CAAC,CAAC,CAAC,OAAO;oBAAE,OAAO;gBACvB,MAAM,KAAK,GAAG,uBAAuB,CAAC,GAAG,CAAC,CAAC;gBAC3C,oEAAoE;gBACpE,sEAAsE;gBACtE,oEAAoE;gBACpE,mEAAmE;gBACnE,+BAA+B;gBAC/B,MAAM,UAAU,GAAG,wBAAwB,CAAC,SAAS,CAAC,KAAK,CAAC,CAAC;gBAC7D,IAAI,CAAC,UAAU,CAAC,OAAO,EAAE,CAAC;oBACxB,OAAO,CAAC,IAAI,CAAC,qDAAqD,EAAE;wBAClE,OAAO,EAAE,KAAK,CAAC,OAAO;wBACtB,MAAM,EAAE,UAAU,CAAC,KAAK,CAAC,MAAM,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,EAAE,IAAI,EAAE,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,GAAG,CAAC,EAAE,OAAO,EAAE,CAAC,CAAC,OAAO,EAAE,CAAC,CAAC;qBAC7F,CAAC,CAAC;gBACL,CAAC;gBACD,MAAM,OAAO,GAAG,iCAAiC,CAAC,SAAS,CAAC,KAAK,CAAC,CAAC;gBACnE,IAAI,CAAC,OAAO,CAAC,OAAO,EAAE,CAAC;oBACrB,qEAAqE;oBACrE,mEAAmE;oBACnE,4DAA4D;oBAC5D,iEAAiE;oBACjE,qBAAqB;oBACrB,MAAM,iBAAiB,GAAG,CAAC,KAAK,CAAC,MAAM,IAAI,EAAE,CAAC,CAAC,MAAM,CACnD,CAAC,GAA2B,EAAE,CAAmC,EAAE,EAAE;wBACnE,IAAI,CAAC,CAAC,IAAI,IAAI,CAAC,CAAC,KAAK;4BAAE,GAAG,CAAC,CAAC,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC,KAAK,CAAC;wBAC7C,OAAO,GAAG,CAAC;oBACb,CAAC,EACD,EAAE,CACH,CAAC;oBACF,OAAO,CAAC,IAAI,CAAC,0DAA0D,EAAE;wBACvE,OAAO,EAAE,KAAK,CAAC,OAAO;wBACtB,gBAAgB,EAAE,KAAK,CAAC,gBAAgB;wBACxC,WAAW,EAAE,iBAAiB;wBAC9B,eAAe,EAAE,KAAK,CAAC,eAAe;wBACtC,WAAW,EAAE,KAAK,CAAC,WAAW;wBAC9B,YAAY,EAAE,KAAK,CAAC,YAAY;wBAChC,MAAM,EAAE,OAAO,CAAC,KAAK,CAAC,MAAM,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC;4BACvC,IAAI,EAAE,CAAC,CAAC,OAAO;4BACf,IAAI,EAAE,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,GAAG,CAAC;yBACvB,CAAC,CAAC;qBACJ,CAAC,CAAC;gBACL,CAAC;gBACD,OAAO,CAAC,KAA2C,CAAC,CAAC;YACvD,CAAC;YAAC,MAAM,CAAC;gBACP,OAAO,EAAE,CAAC;YACZ,CAAC;QACH,CAAC;QAED,KAAK,CAAC,cAAc,CAAC,OAAO;YAC1B,MAAM,cAAc,CAAC,OAAO,CAAC,CAAC;YAC9B,UAAU,CAAC,KAAK,EAAE,CAAC;YACnB,MAAM,SAAS,GAAG,IAAI,CAAC,OAAO,EAAE,wBAAwB,CAAC,CAAC;YAC1D,IAAI,UAAU,CAAC,SAAS,CAAC;gBAAE,UAAU,CAAC,SAAS,CAAC,CAAC;YACjD,UAAU,GAAG,IAAI,CAAC;YAClB,IAAI,OAAO,EAAE,eAAe,EAAE,CAAC;gBAC7B,eAAe,CAAC,OAAO,CAAC,CAAC;YAC3B,CAAC;QACH,CAAC;KACF,CAAC;AACJ,CAAC"}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@zhixuan92/multi-model-agent",
3
- "version": "3.10.3",
3
+ "version": "3.10.5",
4
4
  "type": "module",
5
5
  "license": "MIT",
6
6
  "description": "Standalone HTTP server for multi-model-agent. Routes tool-invocation work to Claude, Codex, or OpenAI-compatible sub-agents with async-polling REST dispatch and installable skills for Claude Code, Gemini CLI, Codex CLI, and Cursor.",
@@ -52,7 +52,7 @@
52
52
  },
53
53
  "dependencies": {
54
54
  "@asteasolutions/zod-to-openapi": "^8.5.0",
55
- "@zhixuan92/multi-model-agent-core": "^3.10.3",
55
+ "@zhixuan92/multi-model-agent-core": "^3.10.5",
56
56
  "gray-matter": "^4.0.3",
57
57
  "minimist": "^1.2.8",
58
58
  "proper-lockfile": "^4.1.2",