moa-cli 0.2.1__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.3
2
2
  Name: moa-cli
3
- Version: 0.2.1
3
+ Version: 0.3.0
4
4
  Summary: Ask one question to multiple local AI coding CLIs in parallel and collect their answers.
5
5
  Keywords: llm,agents,cli,claude,codex,agy,opencode,peer-review
6
6
  Author: Paul-Louis Pröve
@@ -19,7 +19,7 @@ Description-Content-Type: text/markdown
19
19
 
20
20
  # MOA - Mixture of Agents
21
21
 
22
- Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds before a neutral judge gives the verdict.
22
+ Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds while a moderator checks for convergence and writes the verdict.
23
23
 
24
24
  It's a drop-in, batteries-included replacement for hand-rolling parallel `claude -p` / `codex exec` / `opencode run` calls (or a "peer review" agent skill): one command, clean attributed output, made to be called by a human **or** by another agent.
25
25
 
@@ -36,17 +36,44 @@ Or run it once without installing:
36
36
  uvx --from moa-cli moa ask "Review this plan."
37
37
  ```
38
38
 
39
+ > **Requirements.** MOA drives agent CLIs you install separately - it ships no model
40
+ > or API key of its own. You need at least two of `claude` (Claude Code), `codex`,
41
+ > `agy` (Antigravity), and `opencode` on your `PATH` and logged in. Run **`moa doctor`**
42
+ > first to see which ones MOA can find; with only one installed, the "council" collapses
43
+ > to a single answer.
44
+
39
45
  ## Why
40
46
 
41
47
  A single model gives you one perspective. Asking three frontier models the same question - and seeing where they agree, diverge, or contradict - is a fast, cheap way to pressure-test an answer. MOA makes that a one-liner using the CLIs you already pay for, with no API keys of its own.
42
48
 
49
+ ### Example
50
+
51
+ ```text
52
+ $ moa ask "Is Postgres or SQLite better for a desktop app?"
53
+ Asking claude, codex, agy (timeout 180s, read-only)
54
+
55
+ ──────────────── claude (opus) · OK · 3.2s ────────────────
56
+
57
+ For a single-user desktop app, SQLite is almost always the right call:
58
+ zero-config, serverless, the whole DB is one file you can ship... [trimmed]
59
+
60
+ ─────────────── codex (gpt-5.5) · OK · 4.1s ───────────────
61
+
62
+ Use SQLite unless you expect concurrent writers or need network access.
63
+ For a desktop app neither is likely, so SQLite wins on simplicity... [trimmed]
64
+ ```
65
+
66
+ The selection note goes to stderr; the attributed answers go to stdout. In a terminal
67
+ each answer gets the rule shown above; when piped or read by another agent, the same
68
+ blocks render as plain `## ...` headings. Add `--json` for machine-readable JSONL.
69
+
43
70
  ## Usage
44
71
 
45
72
  MOA has three prompt verbs that share the same selection/output options:
46
73
 
47
74
  - **`moa ask PROMPT`** - council / peer review: N agents answer the same prompt in parallel; every answer is returned with attribution, streamed as it lands.
48
75
  - **`moa distill PROMPT`** - synthesis: run the council, then one strong aggregator merges the answers into a single unified response.
49
- - **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, then a separate neutral judge writes the final verdict. The costliest mode; read the caveats below before reaching for it.
76
+ - **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, with a moderator that checks for convergence between rounds and writes the final verdict. The costliest mode; read the caveats below before reaching for it.
50
77
 
51
78
  ```bash
52
79
  moa doctor # show installed CLIs and their default models
@@ -60,12 +87,12 @@ moa ask --json "..." # machine-readable JSONL (for agents
60
87
  git diff | moa ask -f - "Review this diff." # read the prompt from stdin
61
88
  moa distill "Design a rate limiter." # council, then merge into one answer
62
89
  moa distill -s codex "..." # pick who distills (auto | random | provider)
63
- moa debate "Is this race condition real?" # 2 debaters + a judge (default n=3)
90
+ moa debate "Is this race condition real?" # 2 debaters; the first also moderates (default 2 agents)
64
91
  moa debate -r 3 "..." # more rounds (default 2, hard max 4)
65
- moa debate -j claude "..." # pin who judges (must not be a debater)
92
+ moa debate --moderator agy "..." # pin a neutral moderator (a non-debater)
66
93
  ```
67
94
 
68
- The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `-j/--judge`.
95
+ The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `--moderator`.
69
96
 
70
97
  ### Read-only by default
71
98
 
@@ -176,17 +203,17 @@ moa config unset num # remove a key
176
203
  moa config unset model claude # remove one [models] entry
177
204
  ```
178
205
 
179
- The synthesizer default is persistable too (e.g. `moa config set synthesizer codex`); `debate`'s `-r/--rounds` and `-j/--judge` are not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
206
+ The role defaults are persistable too: the distill `synthesizer` and the debate `moderator` (e.g. `moa config set synthesizer codex`, `moa config set moderator agy`). `debate`'s `-r/--rounds` is not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
180
207
 
181
208
  ### Output
182
209
 
183
- - **stdout** carries only content: each agent's answer is fronted by a centered separator rule naming it (`──── claude (opus) · OK · 3.5s ────`) with blank lines around it for clear separation, flushed the instant that agent finishes. `moa distill` then appends the merged block (`──── synthesis · via claude · OK · ... ────`) once the aggregator finishes.
210
+ - **stdout** carries only content. In a terminal, each agent's answer is fronted by a centered box-drawing rule naming it (`──── claude (opus) · OK · 3.5s ────`) with blank lines for separation, flushed the instant that agent finishes. When stdout is **piped or read by an agent** (not a TTY), the same block renders as a plain, low-noise `## claude (opus) · OK · 3.5s` heading instead - no box-drawing. `moa distill` emits only the final merged block.
184
211
  - **stderr** carries progress and selection notes (`Asking claude, codex ...`), so piping stdout stays clean.
185
- - `--json` emits one JSON object per line (JSONL): a `{"type": "response", ...}` record per agent as it completes; `distill` then adds a `{"type": "synthesis", ...}` record. `debate` instead emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record. Ideal when another agent calls MOA and parses the result.
212
+ - `--json` emits one JSON object per line (JSONL): `ask` writes a `{"type": "response", ...}` record per agent as it completes; `distill` writes a single `{"type": "synthesis", ...}` record (only the merged answer); `debate` writes a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record. Ideal when another agent calls MOA and parses the result.
186
213
 
187
214
  ### `moa distill` (synthesis)
188
215
 
189
- `distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. It needs at least two successful proposer answers; with fewer it streams what it has and skips the merge. The aggregator is chosen with `-s/--synthesizer`:
216
+ `distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. **It returns only that merged answer** - the individual proposer responses are intermediates and are not printed (each one's arrival is noted on stderr so the wait isn't silent). It needs at least two successful proposer answers; with fewer it skips the merge and says so on stderr. The aggregator is chosen with `-s/--synthesizer`:
190
217
 
191
218
  - `auto` (default) - the highest-priority agent that ran (deterministic)
192
219
  - `random` - pick one of the agents that ran, at random
@@ -194,23 +221,23 @@ The synthesizer default is persistable too (e.g. `moa config set synthesizer cod
194
221
 
195
222
  The aggregator prompt is adapted from the Mixture-of-Agents "Aggregate-and-Synthesize" prompt (Wang et al. 2024): it tells the aggregator to critically evaluate the inputs (some may be biased or incorrect) and not to simply replicate them but offer a refined, accurate, comprehensive reply.
196
223
 
197
- ### `moa debate` (sequential debate + neutral judge)
224
+ ### `moa debate` (sequential debate + moderator)
198
225
 
199
- `debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange and then asks a **separate neutral judge** to write the final answer.
226
+ `debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange overseen by a **moderator** that checks for convergence between rounds and writes the final answer.
200
227
 
201
- **Roles.** By default the top **2** selected agents are the debaters and the **3rd** is the judge - so the default `-n 3` maps to *2 debaters + 1 judge*. Pin a specific judge with `-j/--judge PROVIDER`; the judge must be one of the selected agents and must **not** also be a debater. Debate needs at least 2 debaters and 1 distinct judge, so it needs at least 3 agents; with fewer it exits with a clear message rather than silently degrading.
228
+ **Roles.** The top **2** selected agents are the debaters. The **moderator** runs the per-round convergence check and writes the verdict; by default it is the top-priority selected agent (so the default 2-agent debate has agent #1 also moderate). Debate only needs **2 agents**; with fewer it exits cleanly rather than silently degrading. For a **neutral** moderator that doesn't also debate, select a third agent and pin it: `moa debate -n 3 --moderator <provider>` (the moderator must be one of the selected agents). The moderator only ever sees the transcript **anonymized + shuffled**, so even when it is itself a debater it can't favour its own answer.
202
229
 
203
230
  **Rounds.** `-r/--rounds` defaults to **2** (gains plateau around 2-3 rounds while token cost grows multiplicatively) and is hard-capped at **4** - higher values are clamped with a warning on stderr.
204
231
 
205
- **The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. If every debater signals it has *no substantive change* (it may open its reply with `NO SUBSTANTIVE CHANGE`), the debate stops early before the cap.
232
+ **The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. After each non-final round the **moderator** reads the debaters' latest answers and replies `DONE` (they've converged or fully aired their disagreement) or `CONTINUE`; a `DONE` stops the debate before the cap.
206
233
 
207
- **The judge.** A model that is **not** a debater reads the full transcript - presented **anonymized and order-shuffled** (a model is judging, so brand/position bias is killed, per item 002) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The judge's verdict is the final block (`──── verdict · judge <name> · ... ────`).
234
+ **The verdict.** The moderator reads the full transcript - presented **anonymized and order-shuffled** (so brand/position bias is killed, even when the moderator was a debater) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The verdict is the final block (`──── verdict · moderator <name> · ... ────`).
208
235
 
209
- **Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the judge's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record.
236
+ **Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the moderator's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", "moderator": "<name>", ...}` record.
210
237
 
211
- **Safety.** Debaters and the judge run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
238
+ **Safety.** Debaters and the moderator run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
212
239
 
213
- > **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters x rounds + 1` model calls) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The separate neutral judge and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
240
+ > **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters × rounds` calls, plus a moderator check per round and the verdict) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The moderator and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
214
241
 
215
242
  ### Attribution policy
216
243
 
@@ -231,6 +258,13 @@ Invocations below show the default (read-only) flags; `--yolo` swaps in each too
231
258
 
232
259
  Adding a new agent is a single entry in the `PROVIDERS` table in `src/moa_cli/cli.py` (executable, default model, command builder, permission flags); it then participates in detection, `-n` selection, and `distill` automatically.
233
260
 
261
+ ## Agent skill
262
+
263
+ If you drive MOA from an agent (e.g. Claude Code), there's a ready-made skill at
264
+ [`skills/moa/SKILL.md`](skills/moa/SKILL.md): it tells an agent when to reach for MOA and
265
+ how to use it (verb choice, self-exclusion via `-x <self>`, parsing the JSONL output). It
266
+ supersedes hand-rolling a "peer review" skill.
267
+
234
268
  ## Development
235
269
 
236
270
  ```bash
@@ -8,7 +8,7 @@
8
8
 
9
9
  # MOA - Mixture of Agents
10
10
 
11
- Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds before a neutral judge gives the verdict.
11
+ Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds while a moderator checks for convergence and writes the verdict.
12
12
 
13
13
  It's a drop-in, batteries-included replacement for hand-rolling parallel `claude -p` / `codex exec` / `opencode run` calls (or a "peer review" agent skill): one command, clean attributed output, made to be called by a human **or** by another agent.
14
14
 
@@ -25,17 +25,44 @@ Or run it once without installing:
25
25
  uvx --from moa-cli moa ask "Review this plan."
26
26
  ```
27
27
 
28
+ > **Requirements.** MOA drives agent CLIs you install separately - it ships no model
29
+ > or API key of its own. You need at least two of `claude` (Claude Code), `codex`,
30
+ > `agy` (Antigravity), and `opencode` on your `PATH` and logged in. Run **`moa doctor`**
31
+ > first to see which ones MOA can find; with only one installed, the "council" collapses
32
+ > to a single answer.
33
+
28
34
  ## Why
29
35
 
30
36
  A single model gives you one perspective. Asking three frontier models the same question - and seeing where they agree, diverge, or contradict - is a fast, cheap way to pressure-test an answer. MOA makes that a one-liner using the CLIs you already pay for, with no API keys of its own.
31
37
 
38
+ ### Example
39
+
40
+ ```text
41
+ $ moa ask "Is Postgres or SQLite better for a desktop app?"
42
+ Asking claude, codex, agy (timeout 180s, read-only)
43
+
44
+ ──────────────── claude (opus) · OK · 3.2s ────────────────
45
+
46
+ For a single-user desktop app, SQLite is almost always the right call:
47
+ zero-config, serverless, the whole DB is one file you can ship... [trimmed]
48
+
49
+ ─────────────── codex (gpt-5.5) · OK · 4.1s ───────────────
50
+
51
+ Use SQLite unless you expect concurrent writers or need network access.
52
+ For a desktop app neither is likely, so SQLite wins on simplicity... [trimmed]
53
+ ```
54
+
55
+ The selection note goes to stderr; the attributed answers go to stdout. In a terminal
56
+ each answer gets the rule shown above; when piped or read by another agent, the same
57
+ blocks render as plain `## ...` headings. Add `--json` for machine-readable JSONL.
58
+
32
59
  ## Usage
33
60
 
34
61
  MOA has three prompt verbs that share the same selection/output options:
35
62
 
36
63
  - **`moa ask PROMPT`** - council / peer review: N agents answer the same prompt in parallel; every answer is returned with attribution, streamed as it lands.
37
64
  - **`moa distill PROMPT`** - synthesis: run the council, then one strong aggregator merges the answers into a single unified response.
38
- - **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, then a separate neutral judge writes the final verdict. The costliest mode; read the caveats below before reaching for it.
65
+ - **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, with a moderator that checks for convergence between rounds and writes the final verdict. The costliest mode; read the caveats below before reaching for it.
39
66
 
40
67
  ```bash
41
68
  moa doctor # show installed CLIs and their default models
@@ -49,12 +76,12 @@ moa ask --json "..." # machine-readable JSONL (for agents
49
76
  git diff | moa ask -f - "Review this diff." # read the prompt from stdin
50
77
  moa distill "Design a rate limiter." # council, then merge into one answer
51
78
  moa distill -s codex "..." # pick who distills (auto | random | provider)
52
- moa debate "Is this race condition real?" # 2 debaters + a judge (default n=3)
79
+ moa debate "Is this race condition real?" # 2 debaters; the first also moderates (default 2 agents)
53
80
  moa debate -r 3 "..." # more rounds (default 2, hard max 4)
54
- moa debate -j claude "..." # pin who judges (must not be a debater)
81
+ moa debate --moderator agy "..." # pin a neutral moderator (a non-debater)
55
82
  ```
56
83
 
57
- The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `-j/--judge`.
84
+ The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `--moderator`.
58
85
 
59
86
  ### Read-only by default
60
87
 
@@ -165,17 +192,17 @@ moa config unset num # remove a key
165
192
  moa config unset model claude # remove one [models] entry
166
193
  ```
167
194
 
168
- The synthesizer default is persistable too (e.g. `moa config set synthesizer codex`); `debate`'s `-r/--rounds` and `-j/--judge` are not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
195
+ The role defaults are persistable too: the distill `synthesizer` and the debate `moderator` (e.g. `moa config set synthesizer codex`, `moa config set moderator agy`). `debate`'s `-r/--rounds` is not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
169
196
 
170
197
  ### Output
171
198
 
172
- - **stdout** carries only content: each agent's answer is fronted by a centered separator rule naming it (`──── claude (opus) · OK · 3.5s ────`) with blank lines around it for clear separation, flushed the instant that agent finishes. `moa distill` then appends the merged block (`──── synthesis · via claude · OK · ... ────`) once the aggregator finishes.
199
+ - **stdout** carries only content. In a terminal, each agent's answer is fronted by a centered box-drawing rule naming it (`──── claude (opus) · OK · 3.5s ────`) with blank lines for separation, flushed the instant that agent finishes. When stdout is **piped or read by an agent** (not a TTY), the same block renders as a plain, low-noise `## claude (opus) · OK · 3.5s` heading instead - no box-drawing. `moa distill` emits only the final merged block.
173
200
  - **stderr** carries progress and selection notes (`Asking claude, codex ...`), so piping stdout stays clean.
174
- - `--json` emits one JSON object per line (JSONL): a `{"type": "response", ...}` record per agent as it completes; `distill` then adds a `{"type": "synthesis", ...}` record. `debate` instead emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record. Ideal when another agent calls MOA and parses the result.
201
+ - `--json` emits one JSON object per line (JSONL): `ask` writes a `{"type": "response", ...}` record per agent as it completes; `distill` writes a single `{"type": "synthesis", ...}` record (only the merged answer); `debate` writes a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record. Ideal when another agent calls MOA and parses the result.
175
202
 
176
203
  ### `moa distill` (synthesis)
177
204
 
178
- `distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. It needs at least two successful proposer answers; with fewer it streams what it has and skips the merge. The aggregator is chosen with `-s/--synthesizer`:
205
+ `distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. **It returns only that merged answer** - the individual proposer responses are intermediates and are not printed (each one's arrival is noted on stderr so the wait isn't silent). It needs at least two successful proposer answers; with fewer it skips the merge and says so on stderr. The aggregator is chosen with `-s/--synthesizer`:
179
206
 
180
207
  - `auto` (default) - the highest-priority agent that ran (deterministic)
181
208
  - `random` - pick one of the agents that ran, at random
@@ -183,23 +210,23 @@ The synthesizer default is persistable too (e.g. `moa config set synthesizer cod
183
210
 
184
211
  The aggregator prompt is adapted from the Mixture-of-Agents "Aggregate-and-Synthesize" prompt (Wang et al. 2024): it tells the aggregator to critically evaluate the inputs (some may be biased or incorrect) and not to simply replicate them but offer a refined, accurate, comprehensive reply.
185
212
 
186
- ### `moa debate` (sequential debate + neutral judge)
213
+ ### `moa debate` (sequential debate + moderator)
187
214
 
188
- `debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange and then asks a **separate neutral judge** to write the final answer.
215
+ `debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange overseen by a **moderator** that checks for convergence between rounds and writes the final answer.
189
216
 
190
- **Roles.** By default the top **2** selected agents are the debaters and the **3rd** is the judge - so the default `-n 3` maps to *2 debaters + 1 judge*. Pin a specific judge with `-j/--judge PROVIDER`; the judge must be one of the selected agents and must **not** also be a debater. Debate needs at least 2 debaters and 1 distinct judge, so it needs at least 3 agents; with fewer it exits with a clear message rather than silently degrading.
217
+ **Roles.** The top **2** selected agents are the debaters. The **moderator** runs the per-round convergence check and writes the verdict; by default it is the top-priority selected agent (so the default 2-agent debate has agent #1 also moderate). Debate only needs **2 agents**; with fewer it exits cleanly rather than silently degrading. For a **neutral** moderator that doesn't also debate, select a third agent and pin it: `moa debate -n 3 --moderator <provider>` (the moderator must be one of the selected agents). The moderator only ever sees the transcript **anonymized + shuffled**, so even when it is itself a debater it can't favour its own answer.
191
218
 
192
219
  **Rounds.** `-r/--rounds` defaults to **2** (gains plateau around 2-3 rounds while token cost grows multiplicatively) and is hard-capped at **4** - higher values are clamped with a warning on stderr.
193
220
 
194
- **The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. If every debater signals it has *no substantive change* (it may open its reply with `NO SUBSTANTIVE CHANGE`), the debate stops early before the cap.
221
+ **The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. After each non-final round the **moderator** reads the debaters' latest answers and replies `DONE` (they've converged or fully aired their disagreement) or `CONTINUE`; a `DONE` stops the debate before the cap.
195
222
 
196
- **The judge.** A model that is **not** a debater reads the full transcript - presented **anonymized and order-shuffled** (a model is judging, so brand/position bias is killed, per item 002) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The judge's verdict is the final block (`──── verdict · judge <name> · ... ────`).
223
+ **The verdict.** The moderator reads the full transcript - presented **anonymized and order-shuffled** (so brand/position bias is killed, even when the moderator was a debater) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The verdict is the final block (`──── verdict · moderator <name> · ... ────`).
197
224
 
198
- **Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the judge's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record.
225
+ **Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the moderator's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", "moderator": "<name>", ...}` record.
199
226
 
200
- **Safety.** Debaters and the judge run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
227
+ **Safety.** Debaters and the moderator run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
201
228
 
202
- > **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters x rounds + 1` model calls) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The separate neutral judge and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
229
+ > **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters × rounds` calls, plus a moderator check per round and the verdict) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The moderator and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
203
230
 
204
231
  ### Attribution policy
205
232
 
@@ -220,6 +247,13 @@ Invocations below show the default (read-only) flags; `--yolo` swaps in each too
220
247
 
221
248
  Adding a new agent is a single entry in the `PROVIDERS` table in `src/moa_cli/cli.py` (executable, default model, command builder, permission flags); it then participates in detection, `-n` selection, and `distill` automatically.
222
249
 
250
+ ## Agent skill
251
+
252
+ If you drive MOA from an agent (e.g. Claude Code), there's a ready-made skill at
253
+ [`skills/moa/SKILL.md`](skills/moa/SKILL.md): it tells an agent when to reach for MOA and
254
+ how to use it (verb choice, self-exclusion via `-x <self>`, parsing the JSONL output). It
255
+ supersedes hand-rolling a "peer review" skill.
256
+
223
257
  ## Development
224
258
 
225
259
  ```bash
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "moa-cli"
3
- version = "0.2.1"
3
+ version = "0.3.0"
4
4
  description = "Ask one question to multiple local AI coding CLIs in parallel and collect their answers."
5
5
  readme = "README.md"
6
6
  authors = [
@@ -1,3 +1,3 @@
1
1
  """MOA CLI package."""
2
2
 
3
- __version__ = "0.2.1"
3
+ __version__ = "0.3.0"
@@ -377,11 +377,12 @@ def build_synthesis_prompt(
377
377
 
378
378
 
379
379
  # --------------------------------------------------------------------------- #
380
- # Debate: sequential adversarial rounds, then one neutral judge writes the
381
- # verdict from the (anonymized + shuffled) full transcript. The literature is
382
- # clear that debate is the costliest and least reliably-beneficial mode: it can
383
- # converge on a wrong answer (conformity), so the judge is a separate model and
384
- # its prompt weighs correctness/evidence over confidence and fluency.
380
+ # Debate: sequential adversarial rounds, with a moderator that checks for
381
+ # convergence after each round and then writes the verdict from the (anonymized
382
+ # + shuffled) full transcript. The literature is clear that debate is the
383
+ # costliest and least reliably-beneficial mode: it can converge on a wrong answer
384
+ # (conformity), so the verdict prompt weighs correctness/evidence over confidence
385
+ # and fluency, and the anonymization holds even when the moderator also debated.
385
386
  # --------------------------------------------------------------------------- #
386
387
 
387
388
  ROUNDS_MAX = 4
@@ -393,23 +394,16 @@ ADVERSARIAL_INSTRUCTION = """Before giving your own answer, critically examine t
393
394
  other participant's answer above: identify any errors, weaknesses, unsupported claims, or \
394
395
  gaps in reasoning. Do NOT agree merely to reach consensus - only concede a point if it is \
395
396
  genuinely correct. Then give your own best, complete answer to the original question, \
396
- incorporating any valid corrections.
397
+ incorporating any valid corrections."""
397
398
 
398
- If, after this scrutiny, you have no substantive change to your previous answer and you agree \
399
- with the other participant, say so explicitly by starting your reply with the line \
400
- "NO SUBSTANTIVE CHANGE" - this lets the debate stop early."""
401
-
402
- # Phrase a debater emits when it has nothing substantive to add. When all active
403
- # debaters in a round signal this, the debate stops before the round cap.
404
- CONVERGENCE_MARKER = "NO SUBSTANTIVE CHANGE"
405
-
406
- # The neutral judge reads the full transcript (anonymized + shuffled) and writes
407
- # the final answer. It must weigh correctness/evidence over confidence/fluency -
408
- # this is where conformity-to-a-wrong-answer is most dangerous, so the judge
409
- # never just echoes the most fluent or most confident debater.
410
- JUDGE_PROMPT = """You are a neutral judge. Below is a transcript of a debate between AI coding \
411
- assistants who answered the user's question and then critiqued each other's answers across \
412
- several rounds. The participants are anonymized and presented in arbitrary order.
399
+ # The moderator reads the full transcript (anonymized + shuffled) and writes the
400
+ # final answer. It must weigh correctness/evidence over confidence/fluency - this
401
+ # is where conformity-to-a-wrong-answer is most dangerous, so it never just echoes
402
+ # the most fluent or most confident debater.
403
+ MODERATOR_VERDICT_PROMPT = """You are the moderator of this debate. Below is a transcript of a \
404
+ debate between AI coding assistants who answered the user's question and then critiqued each \
405
+ other's answers across several rounds. The participants are anonymized and presented in \
406
+ arbitrary order.
413
407
 
414
408
  Your task is to read the full debate and write the single best, final answer to the user's \
415
409
  question. Weigh correctness and the strength of evidence and reasoning ABOVE confidence, \
@@ -424,43 +418,50 @@ best possible answer.
424
418
  asserted.
425
419
  - Do not invent information that the debate does not support."""
426
420
 
421
+ # After each non-final round the moderator decides whether another round would
422
+ # materially help. It replies with a single leading word the caller branches on.
423
+ CONVERGENCE_DONE = "DONE"
424
+ MODERATOR_CONVERGENCE_PROMPT = """You are the moderator of this debate. Below are the debaters' \
425
+ latest answers to the user's question, anonymized. Decide whether they have converged on an \
426
+ answer, or at least fully aired and clarified their disagreement, so that another round would \
427
+ add nothing material.
428
+
429
+ Reply with EXACTLY one word on the first line: DONE if the debate should stop now, or CONTINUE \
430
+ if another round would materially improve the final answer. Add nothing else."""
431
+
427
432
 
428
433
  def assign_debate_roles(
429
- selected: list[Provider], judge: str | None
434
+ selected: list[Provider], moderator: str | None
430
435
  ) -> tuple[list[Provider], Provider]:
431
- """Split the selected providers into (debaters, judge).
432
-
433
- Default: the top 2 selected providers debate and the next one judges (so the
434
- default n=3 maps to 2 debaters + 1 judge). `judge` (from -j/--judge) pins the
435
- judge to a named provider, which must be one of the selected providers and
436
- must NOT also be a debater. Requires at least 2 debaters and 1 distinct judge;
437
- raises ValueError otherwise (the caller turns this into a clean exit - debate
438
- never silently degrades to fewer participants).
436
+ """Split the selected providers into (debaters, moderator).
437
+
438
+ The top 2 selected providers debate. The moderator runs the per-round
439
+ convergence check and writes the final verdict; it MAY be one of the debaters.
440
+ `moderator` is "auto" (or None) -> the top-priority selected provider (so the
441
+ default 2-agent debate has agent #1 also moderate), or a provider name that
442
+ must be among the selected providers (pin a non-debating 3rd for a neutral
443
+ moderator). Requires at least 2 selected providers; raises ValueError
444
+ otherwise (the caller turns this into a clean exit - debate never silently
445
+ degrades).
439
446
  """
440
- if judge is not None:
441
- names = [p.name for p in selected]
442
- if judge not in PROVIDERS:
443
- raise ValueError(f"Unknown judge: {judge}")
444
- if judge not in names:
445
- raise ValueError(
446
- f"Judge {judge!r} is not among the selected providers ({', '.join(names)}). "
447
- f"Pin it with -p {judge} or widen the selection."
448
- )
449
- judge_provider = next(p for p in selected if p.name == judge)
450
- debaters = [p for p in selected if p.name != judge]
451
- if len(debaters) < 2:
452
- raise ValueError(
453
- f"debate needs at least 2 debaters plus the judge ({judge}); only "
454
- f"{len(debaters)} non-judge provider(s) available. Increase -n or -p."
455
- )
456
- return debaters, judge_provider
457
-
458
- if len(selected) < 3:
447
+ if len(selected) < 2:
459
448
  raise ValueError(
460
- f"debate needs at least 3 providers (2 debaters + 1 neutral judge); "
461
- f"only {len(selected)} available. Increase -n, pin more with -p, or install more agents."
449
+ f"debate needs at least 2 providers (2 debaters); only {len(selected)} available. "
450
+ f"Increase -n, pin more with -p, or install more agents."
462
451
  )
463
- return selected[:2], selected[2]
452
+ debaters = selected[:2]
453
+ if moderator in (None, "auto"):
454
+ return debaters, selected[0]
455
+
456
+ names = [p.name for p in selected]
457
+ if moderator not in PROVIDERS:
458
+ raise ValueError(f"Unknown moderator: {moderator}")
459
+ if moderator not in names:
460
+ raise ValueError(
461
+ f"Moderator {moderator!r} is not among the selected providers ({', '.join(names)}). "
462
+ f"Pin it with -p {moderator} or widen the selection."
463
+ )
464
+ return debaters, next(p for p in selected if p.name == moderator)
464
465
 
465
466
 
466
467
  def clamp_rounds(rounds: int) -> tuple[int, str | None]:
@@ -496,18 +497,19 @@ def build_debate_turn_prompt(
496
497
  )
497
498
 
498
499
 
499
- def build_judge_prompt(
500
+ def build_verdict_prompt(
500
501
  question: str,
501
502
  transcript: list[RunResult],
502
503
  rng: random.Random | None = None,
503
504
  ) -> tuple[str, dict[str, str]]:
504
- """Build the judge prompt from the debate transcript, anonymized + shuffled.
505
-
506
- The transcript is the per-turn RunResults; the judge sees only the final
507
- answer text of each turn, relabelled "Participant 1/2/.." in shuffled order
508
- (a model is judging, so brand/position bias is killed per the research). The
509
- label_map maps each label back to the real provider for the caller, though
510
- debate does not reveal it in the verdict.
505
+ """Build the moderator's final-verdict prompt from the transcript, anonymized
506
+ + shuffled.
507
+
508
+ The transcript is the per-turn RunResults; the moderator sees only the final
509
+ answer text of each turn, relabelled "Participant 1/2/.." in shuffled order so
510
+ brand/position bias is killed - this matters even when the moderator is itself
511
+ a debater, since it can't tell which answer is its own. The label_map maps each
512
+ label back to the real provider for the caller, though debate never reveals it.
511
513
  """
512
514
  turns = [r for r in transcript if r.status == "ok"]
513
515
  shuffled = list(turns)
@@ -519,13 +521,27 @@ def build_judge_prompt(
519
521
  sections.append(f"### {label}\n\n{result.stdout.strip()}")
520
522
  label_map[label] = result.provider
521
523
  prompt = (
522
- f"{JUDGE_PROMPT}\n\n"
524
+ f"{MODERATOR_VERDICT_PROMPT}\n\n"
523
525
  f"## User question\n\n{question}\n\n"
524
526
  f"## Debate transcript\n\n" + "\n\n".join(sections) + "\n\n## Your final answer\n"
525
527
  )
526
528
  return prompt, label_map
527
529
 
528
530
 
531
+ def build_convergence_prompt(question: str, latest: list[RunResult]) -> str:
532
+ """The moderator's per-round convergence check. `latest` is the debaters' most
533
+ recent answers, anonymized so the moderator judges substance over brand. The
534
+ expected reply starts with DONE (stop) or CONTINUE (another round helps)."""
535
+ answers = "\n\n".join(
536
+ f"### Participant {i + 1}\n\n{r.stdout.strip()}" for i, r in enumerate(latest)
537
+ )
538
+ return (
539
+ f"{MODERATOR_CONVERGENCE_PROMPT}\n\n"
540
+ f"## User question\n\n{question}\n\n"
541
+ f"## The debaters' latest answers\n\n{answers}\n\n## Your decision\n"
542
+ )
543
+
544
+
529
545
  # --------------------------------------------------------------------------- #
530
546
  # Render: stdout carries content (Markdown or JSONL); stderr carries progress.
531
547
  # --------------------------------------------------------------------------- #
@@ -560,21 +576,36 @@ def _body(result: RunResult) -> list[str]:
560
576
  return ["```text", detail[-1200:], "```", ""]
561
577
 
562
578
 
563
- def _render(label: str, result: RunResult) -> str:
564
- """A block: two leading blank lines, the named rule, a blank line, the body.
565
- The leading blanks give each answer clear breathing room as blocks stream."""
579
+ def _plain_output() -> bool:
580
+ """True when stdout is not an interactive terminal - piped, redirected, or
581
+ read by another agent (the common "an agent shells out to moa" case). There
582
+ we drop the decorative box-drawing rule and extra blank lines for a plain,
583
+ low-noise `## label` heading that is cheaper for a model to consume."""
584
+ return not sys.stdout.isatty()
585
+
586
+
587
+ def _render(label: str, result: RunResult, plain: bool) -> str:
588
+ """One answer block. In a terminal: two leading blank lines and a centered
589
+ box-drawing rule, for clear visual separation as blocks stream in. When
590
+ piped: a plain `## label` heading with a single blank line, no box-drawing."""
591
+ if plain:
592
+ return "\n".join(["", f"## {label}", "", *_body(result)])
566
593
  return "\n".join(["", "", _rule(label), "", *_body(result)])
567
594
 
568
595
 
569
- def render_block(result: RunResult) -> str:
596
+ def render_block(result: RunResult, plain: bool | None = None) -> str:
597
+ if plain is None:
598
+ plain = _plain_output()
570
599
  model = f" ({result.model})" if result.model else ""
571
600
  label = f"{result.provider}{model} · {_status_label(result.status)} · {result.elapsed:.1f}s"
572
- return _render(label, result)
601
+ return _render(label, result, plain)
573
602
 
574
603
 
575
- def render_synthesis_block(result: RunResult, synthesizer: str) -> str:
604
+ def render_synthesis_block(result: RunResult, synthesizer: str, plain: bool | None = None) -> str:
605
+ if plain is None:
606
+ plain = _plain_output()
576
607
  label = f"synthesis · via {synthesizer} · {_status_label(result.status)} · {result.elapsed:.1f}s"
577
- return _render(label, result)
608
+ return _render(label, result, plain)
578
609
 
579
610
 
580
611
  def result_record(result: RunResult) -> dict:
@@ -601,18 +632,22 @@ def synthesis_record(result: RunResult, synthesizer: str) -> dict:
601
632
  }
602
633
 
603
634
 
604
- def render_debate_turn_block(result: RunResult, round_num: int) -> str:
635
+ def render_debate_turn_block(result: RunResult, round_num: int, plain: bool | None = None) -> str:
636
+ if plain is None:
637
+ plain = _plain_output()
605
638
  model = f" ({result.model})" if result.model else ""
606
639
  label = (
607
640
  f"round {round_num} · {result.provider}{model} · "
608
641
  f"{_status_label(result.status)} · {result.elapsed:.1f}s"
609
642
  )
610
- return _render(label, result)
643
+ return _render(label, result, plain)
611
644
 
612
645
 
613
- def render_judge_block(result: RunResult, judge: str) -> str:
614
- label = f"verdict · judge {judge} · {_status_label(result.status)} · {result.elapsed:.1f}s"
615
- return _render(label, result)
646
+ def render_verdict_block(result: RunResult, moderator: str, plain: bool | None = None) -> str:
647
+ if plain is None:
648
+ plain = _plain_output()
649
+ label = f"verdict · moderator {moderator} · {_status_label(result.status)} · {result.elapsed:.1f}s"
650
+ return _render(label, result, plain)
616
651
 
617
652
 
618
653
  def debate_turn_record(result: RunResult, round_num: int) -> dict:
@@ -629,10 +664,10 @@ def debate_turn_record(result: RunResult, round_num: int) -> dict:
629
664
  }
630
665
 
631
666
 
632
- def judge_record(result: RunResult, judge: str) -> dict:
667
+ def verdict_record(result: RunResult, moderator: str) -> dict:
633
668
  return {
634
669
  "type": "verdict",
635
- "judge": judge,
670
+ "moderator": moderator,
636
671
  "status": result.status,
637
672
  "elapsed": round(result.elapsed, 3),
638
673
  "text": result.stdout,
@@ -653,12 +688,21 @@ def judge_record(result: RunResult, judge: str) -> dict:
653
688
 
654
689
  # Scalar config keys and the type each maps to. `exclude` (list[str]) and the
655
690
  # `[models]` table are handled separately because they aren't plain scalars.
656
- _CONFIG_SCALARS: dict[str, type] = {"num": int, "timeout": float, "synthesizer": str}
691
+ _CONFIG_SCALARS: dict[str, type] = {"num": int, "timeout": float, "synthesizer": str, "moderator": str}
657
692
  _CONFIG_KEYS: tuple[str, ...] = (*_CONFIG_SCALARS, "exclude", "models")
658
693
  # Synthesizer accepts the special modes plus any known provider name.
659
694
  _SYNTHESIZER_MODES: tuple[str, ...] = ("auto", "first", "random")
695
+ # Moderator accepts "auto" (the top-priority selected agent) or a provider name.
696
+ _MODERATOR_MODES: tuple[str, ...] = ("auto",)
660
697
  # The built-in defaults, shown by `config show` when a key isn't in the file.
661
- _CONFIG_DEFAULTS: dict = {"num": 3, "timeout": 180.0, "synthesizer": "auto", "exclude": [], "models": {}}
698
+ _CONFIG_DEFAULTS: dict = {
699
+ "num": 3,
700
+ "timeout": 180.0,
701
+ "synthesizer": "auto",
702
+ "moderator": "auto",
703
+ "exclude": [],
704
+ "models": {},
705
+ }
662
706
 
663
707
 
664
708
  def config_dir() -> Path:
@@ -690,6 +734,9 @@ def _validate_scalar(key: str, value) -> None:
690
734
  if key == "synthesizer" and value not in (*_SYNTHESIZER_MODES, *PROVIDERS):
691
735
  allowed = ", ".join((*_SYNTHESIZER_MODES, *PROVIDERS))
692
736
  raise ValueError(f"synthesizer must be one of: {allowed}.")
737
+ if key == "moderator" and value not in (*_MODERATOR_MODES, *PROVIDERS):
738
+ allowed = ", ".join((*_MODERATOR_MODES, *PROVIDERS))
739
+ raise ValueError(f"moderator must be one of: {allowed}.")
693
740
 
694
741
 
695
742
  def load_config() -> dict:
@@ -767,6 +814,8 @@ def serialize_config(config: dict) -> str:
767
814
  lines.append(f"timeout = {int(timeout) if timeout.is_integer() else timeout!r}")
768
815
  if "synthesizer" in config:
769
816
  lines.append(f"synthesizer = {_toml_str(config['synthesizer'])}")
817
+ if "moderator" in config:
818
+ lines.append(f"moderator = {_toml_str(config['moderator'])}")
770
819
  if "exclude" in config:
771
820
  items = ", ".join(_toml_str(v) for v in config["exclude"])
772
821
  lines.append(f"exclude = [{items}]")
@@ -877,11 +926,20 @@ async def _collect(
877
926
  json_output: bool,
878
927
  models: dict[str, str] | None = None,
879
928
  yolo: bool = False,
929
+ emit_blocks: bool = True,
880
930
  ) -> list[RunResult]:
931
+ """Gather every agent's result. With emit_blocks (ask), each complete answer
932
+ is flushed to stdout the instant it arrives. Without it (distill), the
933
+ individual answers are intermediates the user shouldn't see - only the final
934
+ distilled block is content - so we keep stdout clean and just heartbeat each
935
+ arrival to stderr so a multi-agent run doesn't look frozen while it waits."""
881
936
  results: list[RunResult] = []
882
937
  async for result in stream(providers, prompt, timeout, models, yolo):
883
938
  results.append(result)
884
- _emit(json.dumps(result_record(result)) if json_output else render_block(result))
939
+ if emit_blocks:
940
+ _emit(json.dumps(result_record(result)) if json_output else render_block(result))
941
+ else:
942
+ _note(f" {result.provider} responded ({_status_label(result.status)}, {result.elapsed:.1f}s)")
885
943
  return results
886
944
 
887
945
 
@@ -945,6 +1003,7 @@ def resolve_run(
945
1003
  timeout: float | None,
946
1004
  json_output: bool,
947
1005
  yolo: bool,
1006
+ default_num: int = 3,
948
1007
  ) -> RunConfig:
949
1008
  """Resolve the shared options into a RunConfig, emitting the selection note.
950
1009
 
@@ -953,7 +1012,9 @@ def resolve_run(
953
1012
  flag), parse model overrides, select providers, and print the stderr
954
1013
  selection note (including agy's honest partial-protection note). Every verb
955
1014
  picks up config defaults identically because the merge lives only here.
956
- Raises typer.BadParameter on bad input and typer.Exit(1) when nothing runs.
1015
+ `default_num` is the built-in fallback when neither flag nor config sets num
1016
+ (debate passes 2, since it only needs 2 agents). Raises typer.BadParameter on
1017
+ bad input and typer.Exit(1) when nothing runs.
957
1018
  """
958
1019
  prompt_text = _read_prompt(prompt, file)
959
1020
  if not prompt_text:
@@ -965,7 +1026,7 @@ def resolve_run(
965
1026
  except ValueError as exc:
966
1027
  raise typer.BadParameter(f"{config_path()}: {exc}") from exc
967
1028
 
968
- num = resolve_option(num, "num", config, 3)
1029
+ num = resolve_option(num, "num", config, default_num)
969
1030
  timeout = resolve_option(timeout, "timeout", config, 180.0)
970
1031
  # Repeatable flags are an empty list when omitted, not None, so treat empty
971
1032
  # as "fall back to config" for exclude.
@@ -1051,8 +1112,13 @@ def distill(
1051
1112
  # it merges through the same precedence: CLI flag > config file > built-in.
1052
1113
  synthesizer = resolve_option(synthesizer, "synthesizer", _read_config_or_empty(), "auto")
1053
1114
 
1115
+ # distill returns only the merged answer, so the proposer responses are
1116
+ # intermediates: collect them without printing each to stdout.
1054
1117
  results = asyncio.run(
1055
- _collect(cfg.selected, cfg.prompt, cfg.timeout, cfg.json_output, cfg.models, cfg.yolo)
1118
+ _collect(
1119
+ cfg.selected, cfg.prompt, cfg.timeout, cfg.json_output, cfg.models, cfg.yolo,
1120
+ emit_blocks=False,
1121
+ )
1056
1122
  )
1057
1123
  successes = [r for r in results if r.status == "ok"]
1058
1124
 
@@ -1098,9 +1164,12 @@ def _run_synthesis(
1098
1164
  RoundsOpt = Annotated[
1099
1165
  int, typer.Option("--rounds", "-r", help=f"Debate rounds (default 2, hard max {ROUNDS_MAX}).")
1100
1166
  ]
1101
- JudgeOpt = Annotated[
1167
+ ModeratorOpt = Annotated[
1102
1168
  str | None,
1103
- typer.Option("--judge", "-j", help="Provider that judges (must not be a debater)."),
1169
+ typer.Option(
1170
+ "--moderator", "-j",
1171
+ help="Moderator that checks convergence and writes the verdict: auto | a provider.",
1172
+ ),
1104
1173
  ]
1105
1174
 
1106
1175
 
@@ -1114,60 +1183,80 @@ def debate(
1114
1183
  file: FileOpt = None,
1115
1184
  timeout: TimeoutOpt = None,
1116
1185
  rounds: RoundsOpt = 2,
1117
- judge: JudgeOpt = None,
1186
+ moderator: ModeratorOpt = None,
1118
1187
  json_output: JsonOpt = False,
1119
1188
  yolo: YoloOpt = False,
1120
1189
  ) -> None:
1121
- """Debate: debaters answer and critique each other across rounds; a neutral judge gives the verdict."""
1122
- cfg = resolve_run(prompt, file, num, provider, exclude, model, timeout, json_output, yolo)
1190
+ """Debate: two debaters answer and critique each other across rounds; a moderator checks convergence and writes the verdict."""
1191
+ # Debate only needs 2 agents (the moderator may also be a debater), so its
1192
+ # built-in default selection is 2, not the usual 3.
1193
+ cfg = resolve_run(
1194
+ prompt, file, num, provider, exclude, model, timeout, json_output, yolo, default_num=2
1195
+ )
1196
+
1197
+ # moderator is verb-specific (like distill's synthesizer) but persistable, so
1198
+ # it merges through the same precedence: CLI flag > config file > built-in.
1199
+ moderator = resolve_option(moderator, "moderator", _read_config_or_empty(), "auto")
1123
1200
 
1124
1201
  rounds, warning = clamp_rounds(rounds)
1125
1202
  if warning:
1126
1203
  _note(warning)
1127
1204
 
1128
1205
  try:
1129
- debaters, judge_provider = assign_debate_roles(cfg.selected, judge)
1206
+ debaters, moderator_provider = assign_debate_roles(cfg.selected, moderator)
1130
1207
  except ValueError as exc:
1131
1208
  _note(f"debate: {exc}")
1132
1209
  raise typer.Exit(code=1) from exc
1133
1210
 
1134
1211
  _note(
1135
- f"Debating: {', '.join(p.name for p in debaters)} over {rounds} round(s), "
1136
- f"judge {judge_provider.name}. Debate is the costliest mode "
1137
- f"(~{len(debaters) * rounds + 1} model calls) and can converge on a wrong answer."
1212
+ f"Debating: {', '.join(p.name for p in debaters)} over up to {rounds} round(s), "
1213
+ f"moderator {moderator_provider.name}. Debate is the costliest mode and can "
1214
+ f"converge on a wrong answer."
1138
1215
  )
1139
1216
 
1140
- transcript = asyncio.run(_run_debate(cfg, debaters, judge_provider, rounds))
1217
+ transcript = asyncio.run(_run_debate(cfg, debaters, moderator_provider, rounds))
1141
1218
  if not any(r.status == "ok" for r in transcript):
1142
1219
  raise typer.Exit(code=1)
1143
1220
 
1144
1221
 
1145
- def _signals_convergence(result: RunResult) -> bool:
1146
- """A debater concedes when its answer opens with the convergence marker."""
1147
- return result.status == "ok" and result.stdout.strip().upper().startswith(CONVERGENCE_MARKER)
1222
+ async def _moderator_signals_done(
1223
+ cfg: RunConfig, moderator: Provider, latest_ok: list[RunResult], round_num: int
1224
+ ) -> bool:
1225
+ """Ask the moderator whether the debate has converged. Returns True (stop)
1226
+ only on a clean DONE reply; a failed or CONTINUE check keeps debating."""
1227
+ prompt = build_convergence_prompt(cfg.prompt, latest_ok)
1228
+ _note(f"Round {round_num}: moderator {moderator.name} checking for convergence...")
1229
+ result = await run_provider(
1230
+ moderator, prompt, cfg.timeout, cfg.models.get(moderator.name), cfg.yolo
1231
+ )
1232
+ done = result.status == "ok" and result.stdout.strip().upper().startswith(CONVERGENCE_DONE)
1233
+ if done:
1234
+ _note(f"Moderator {moderator.name}: converged; stopping after round {round_num}.")
1235
+ return done
1148
1236
 
1149
1237
 
1150
1238
  async def _run_debate(
1151
1239
  cfg: RunConfig,
1152
1240
  debaters: list[Provider],
1153
- judge: Provider,
1241
+ moderator: Provider,
1154
1242
  rounds: int,
1155
1243
  ) -> list[RunResult]:
1156
- """Run the sequential debate, then the judge. Returns the full transcript.
1157
-
1158
- Each debater keeps its latest answer in `latest`. A turn shows the debater
1159
- the OTHER debaters' latest answers (anonymized) plus the adversarial
1160
- instruction; the very first turn (no priors yet) is a cold answer. Turns
1161
- stream as they complete (stderr progress + stdout/JSON block). If every
1162
- active debater signals "no substantive change" in a round, the debate stops
1163
- before the cap. The judge then reads the blind+shuffled transcript and writes
1164
- the verdict last.
1244
+ """Run the sequential debate, then the moderator's verdict. Returns the full
1245
+ transcript.
1246
+
1247
+ Each debater keeps its latest answer in `latest`. A turn shows the debater the
1248
+ OTHER debaters' latest answers (anonymized) plus the adversarial instruction;
1249
+ the very first turn (no priors yet) is a cold answer. Turns stream as they
1250
+ complete (stderr progress + stdout/JSON block). After each non-final round the
1251
+ moderator decides whether the debate has converged and can stop early. The
1252
+ moderator then reads the blind+shuffled transcript and writes the verdict last
1253
+ (it may itself be a debater - the anonymization stops it favouring its own
1254
+ answer).
1165
1255
  """
1166
1256
  transcript: list[RunResult] = []
1167
1257
  latest: dict[str, RunResult] = {}
1168
1258
 
1169
1259
  for round_num in range(1, rounds + 1):
1170
- converged_this_round = True
1171
1260
  for debater in debaters:
1172
1261
  prior = [
1173
1262
  ("the other participant", latest[other.name].stdout)
@@ -1186,34 +1275,36 @@ async def _run_debate(
1186
1275
  if cfg.json_output
1187
1276
  else render_debate_turn_block(result, round_num)
1188
1277
  )
1189
- # A debater that errors out is not "converged"; only an explicit
1190
- # no-change signal counts toward an early stop.
1191
- if not _signals_convergence(result):
1192
- converged_this_round = False
1193
1278
 
1194
- # Round 1 always has at least one cold answer (no prior to converge on),
1195
- # so early-stop is only meaningful from round 2 onward.
1196
- if round_num >= 2 and converged_this_round:
1197
- _note(f"Debate converged after round {round_num} (no substantive changes); stopping early.")
1198
- break
1279
+ # After each non-final round, let the moderator stop early if the debaters
1280
+ # have converged. Needs both debaters' latest answers to compare.
1281
+ if round_num < rounds:
1282
+ latest_ok = [
1283
+ latest[d.name] for d in debaters
1284
+ if d.name in latest and latest[d.name].status == "ok"
1285
+ ]
1286
+ if len(latest_ok) >= 2 and await _moderator_signals_done(
1287
+ cfg, moderator, latest_ok, round_num
1288
+ ):
1289
+ break
1199
1290
 
1200
1291
  if not any(r.status == "ok" for r in transcript):
1201
- _note("Debate produced no usable answers; skipping judge.")
1292
+ _note("Debate produced no usable answers; skipping the moderator verdict.")
1202
1293
  return transcript
1203
1294
 
1204
- # The judge always sees the transcript anonymized + shuffled (a model is
1205
- # judging; per item 002 there is no toggle). It runs in the same read-only /
1206
- # --yolo mode as the debaters - no permission bypass.
1207
- judge_prompt, _label_map = build_judge_prompt(cfg.prompt, transcript)
1208
- _note(f"Judging with {judge.name}...")
1295
+ # The moderator always sees the transcript anonymized + shuffled (a model is
1296
+ # judging; no toggle). It runs in the same read-only / --yolo mode as the
1297
+ # debaters - no permission bypass.
1298
+ verdict_prompt, _label_map = build_verdict_prompt(cfg.prompt, transcript)
1299
+ _note(f"Moderator {moderator.name} writing the final answer...")
1209
1300
  verdict = await run_provider(
1210
- judge, judge_prompt, cfg.timeout, cfg.models.get(judge.name), cfg.yolo
1301
+ moderator, verdict_prompt, cfg.timeout, cfg.models.get(moderator.name), cfg.yolo
1211
1302
  )
1212
1303
  transcript.append(verdict)
1213
1304
  _emit(
1214
- json.dumps(judge_record(verdict, judge.name))
1305
+ json.dumps(verdict_record(verdict, moderator.name))
1215
1306
  if cfg.json_output
1216
- else render_judge_block(verdict, judge.name)
1307
+ else render_verdict_block(verdict, moderator.name)
1217
1308
  )
1218
1309
  return transcript
1219
1310
 
@@ -1280,7 +1371,7 @@ def config_show() -> None:
1280
1371
 
1281
1372
  @config_app.command("set")
1282
1373
  def config_set(
1283
- key: Annotated[str, typer.Argument(help="Config key: num | timeout | synthesizer | exclude | model.")],
1374
+ key: Annotated[str, typer.Argument(help="Config key: num | timeout | synthesizer | moderator | exclude | model.")],
1284
1375
  value: Annotated[str, typer.Argument(help="Value. For models: PROVIDER=MODEL. For exclude: comma-separated names.")],
1285
1376
  ) -> None:
1286
1377
  """Write a value to the config file, creating the dir/file if missing."""
@@ -1313,7 +1404,7 @@ def config_set(
1313
1404
  raise typer.BadParameter(str(exc)) from exc
1314
1405
  config[key] = coerced
1315
1406
  else:
1316
- known = "num, timeout, synthesizer, exclude, model"
1407
+ known = "num, timeout, synthesizer, moderator, exclude, model"
1317
1408
  raise typer.BadParameter(f"Unknown config key: {key!r}. Known: {known}.")
1318
1409
 
1319
1410
  write_config(config)