moa-cli 0.2.1__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: moa-cli
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.0
|
|
4
4
|
Summary: Ask one question to multiple local AI coding CLIs in parallel and collect their answers.
|
|
5
5
|
Keywords: llm,agents,cli,claude,codex,agy,opencode,peer-review
|
|
6
6
|
Author: Paul-Louis Pröve
|
|
@@ -19,7 +19,7 @@ Description-Content-Type: text/markdown
|
|
|
19
19
|
|
|
20
20
|
# MOA - Mixture of Agents
|
|
21
21
|
|
|
22
|
-
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds
|
|
22
|
+
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds while a moderator checks for convergence and writes the verdict.
|
|
23
23
|
|
|
24
24
|
It's a drop-in, batteries-included replacement for hand-rolling parallel `claude -p` / `codex exec` / `opencode run` calls (or a "peer review" agent skill): one command, clean attributed output, made to be called by a human **or** by another agent.
|
|
25
25
|
|
|
@@ -36,17 +36,44 @@ Or run it once without installing:
|
|
|
36
36
|
uvx --from moa-cli moa ask "Review this plan."
|
|
37
37
|
```
|
|
38
38
|
|
|
39
|
+
> **Requirements.** MOA drives agent CLIs you install separately - it ships no model
|
|
40
|
+
> or API key of its own. You need at least two of `claude` (Claude Code), `codex`,
|
|
41
|
+
> `agy` (Antigravity), and `opencode` on your `PATH` and logged in. Run **`moa doctor`**
|
|
42
|
+
> first to see which ones MOA can find; with only one installed, the "council" collapses
|
|
43
|
+
> to a single answer.
|
|
44
|
+
|
|
39
45
|
## Why
|
|
40
46
|
|
|
41
47
|
A single model gives you one perspective. Asking three frontier models the same question - and seeing where they agree, diverge, or contradict - is a fast, cheap way to pressure-test an answer. MOA makes that a one-liner using the CLIs you already pay for, with no API keys of its own.
|
|
42
48
|
|
|
49
|
+
### Example
|
|
50
|
+
|
|
51
|
+
```text
|
|
52
|
+
$ moa ask "Is Postgres or SQLite better for a desktop app?"
|
|
53
|
+
Asking claude, codex, agy (timeout 180s, read-only)
|
|
54
|
+
|
|
55
|
+
──────────────── claude (opus) · OK · 3.2s ────────────────
|
|
56
|
+
|
|
57
|
+
For a single-user desktop app, SQLite is almost always the right call:
|
|
58
|
+
zero-config, serverless, the whole DB is one file you can ship... [trimmed]
|
|
59
|
+
|
|
60
|
+
─────────────── codex (gpt-5.5) · OK · 4.1s ───────────────
|
|
61
|
+
|
|
62
|
+
Use SQLite unless you expect concurrent writers or need network access.
|
|
63
|
+
For a desktop app neither is likely, so SQLite wins on simplicity... [trimmed]
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
The selection note goes to stderr; the attributed answers go to stdout. In a terminal
|
|
67
|
+
each answer gets the rule shown above; when piped or read by another agent, the same
|
|
68
|
+
blocks render as plain `## ...` headings. Add `--json` for machine-readable JSONL.
|
|
69
|
+
|
|
43
70
|
## Usage
|
|
44
71
|
|
|
45
72
|
MOA has three prompt verbs that share the same selection/output options:
|
|
46
73
|
|
|
47
74
|
- **`moa ask PROMPT`** - council / peer review: N agents answer the same prompt in parallel; every answer is returned with attribution, streamed as it lands.
|
|
48
75
|
- **`moa distill PROMPT`** - synthesis: run the council, then one strong aggregator merges the answers into a single unified response.
|
|
49
|
-
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds,
|
|
76
|
+
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, with a moderator that checks for convergence between rounds and writes the final verdict. The costliest mode; read the caveats below before reaching for it.
|
|
50
77
|
|
|
51
78
|
```bash
|
|
52
79
|
moa doctor # show installed CLIs and their default models
|
|
@@ -60,12 +87,12 @@ moa ask --json "..." # machine-readable JSONL (for agents
|
|
|
60
87
|
git diff | moa ask -f - "Review this diff." # read the prompt from stdin
|
|
61
88
|
moa distill "Design a rate limiter." # council, then merge into one answer
|
|
62
89
|
moa distill -s codex "..." # pick who distills (auto | random | provider)
|
|
63
|
-
moa debate "Is this race condition real?" # 2 debaters
|
|
90
|
+
moa debate "Is this race condition real?" # 2 debaters; the first also moderates (default 2 agents)
|
|
64
91
|
moa debate -r 3 "..." # more rounds (default 2, hard max 4)
|
|
65
|
-
moa debate
|
|
92
|
+
moa debate --moderator agy "..." # pin a neutral moderator (a non-debater)
|
|
66
93
|
```
|
|
67
94
|
|
|
68
|
-
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and
|
|
95
|
+
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `--moderator`.
|
|
69
96
|
|
|
70
97
|
### Read-only by default
|
|
71
98
|
|
|
@@ -176,17 +203,17 @@ moa config unset num # remove a key
|
|
|
176
203
|
moa config unset model claude # remove one [models] entry
|
|
177
204
|
```
|
|
178
205
|
|
|
179
|
-
The
|
|
206
|
+
The role defaults are persistable too: the distill `synthesizer` and the debate `moderator` (e.g. `moa config set synthesizer codex`, `moa config set moderator agy`). `debate`'s `-r/--rounds` is not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
|
|
180
207
|
|
|
181
208
|
### Output
|
|
182
209
|
|
|
183
|
-
- **stdout** carries only content
|
|
210
|
+
- **stdout** carries only content. In a terminal, each agent's answer is fronted by a centered box-drawing rule naming it (`──── claude (opus) · OK · 3.5s ────`) with blank lines for separation, flushed the instant that agent finishes. When stdout is **piped or read by an agent** (not a TTY), the same block renders as a plain, low-noise `## claude (opus) · OK · 3.5s` heading instead - no box-drawing. `moa distill` emits only the final merged block.
|
|
184
211
|
- **stderr** carries progress and selection notes (`Asking claude, codex ...`), so piping stdout stays clean.
|
|
185
|
-
- `--json` emits one JSON object per line (JSONL): a `{"type": "response", ...}` record per agent as it completes; `distill`
|
|
212
|
+
- `--json` emits one JSON object per line (JSONL): `ask` writes a `{"type": "response", ...}` record per agent as it completes; `distill` writes a single `{"type": "synthesis", ...}` record (only the merged answer); `debate` writes a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record. Ideal when another agent calls MOA and parses the result.
|
|
186
213
|
|
|
187
214
|
### `moa distill` (synthesis)
|
|
188
215
|
|
|
189
|
-
`distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. It needs at least two successful proposer answers; with fewer it
|
|
216
|
+
`distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. **It returns only that merged answer** - the individual proposer responses are intermediates and are not printed (each one's arrival is noted on stderr so the wait isn't silent). It needs at least two successful proposer answers; with fewer it skips the merge and says so on stderr. The aggregator is chosen with `-s/--synthesizer`:
|
|
190
217
|
|
|
191
218
|
- `auto` (default) - the highest-priority agent that ran (deterministic)
|
|
192
219
|
- `random` - pick one of the agents that ran, at random
|
|
@@ -194,23 +221,23 @@ The synthesizer default is persistable too (e.g. `moa config set synthesizer cod
|
|
|
194
221
|
|
|
195
222
|
The aggregator prompt is adapted from the Mixture-of-Agents "Aggregate-and-Synthesize" prompt (Wang et al. 2024): it tells the aggregator to critically evaluate the inputs (some may be biased or incorrect) and not to simply replicate them but offer a refined, accurate, comprehensive reply.
|
|
196
223
|
|
|
197
|
-
### `moa debate` (sequential debate +
|
|
224
|
+
### `moa debate` (sequential debate + moderator)
|
|
198
225
|
|
|
199
|
-
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange
|
|
226
|
+
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange overseen by a **moderator** that checks for convergence between rounds and writes the final answer.
|
|
200
227
|
|
|
201
|
-
**Roles.**
|
|
228
|
+
**Roles.** The top **2** selected agents are the debaters. The **moderator** runs the per-round convergence check and writes the verdict; by default it is the top-priority selected agent (so the default 2-agent debate has agent #1 also moderate). Debate only needs **2 agents**; with fewer it exits cleanly rather than silently degrading. For a **neutral** moderator that doesn't also debate, select a third agent and pin it: `moa debate -n 3 --moderator <provider>` (the moderator must be one of the selected agents). The moderator only ever sees the transcript **anonymized + shuffled**, so even when it is itself a debater it can't favour its own answer.
|
|
202
229
|
|
|
203
230
|
**Rounds.** `-r/--rounds` defaults to **2** (gains plateau around 2-3 rounds while token cost grows multiplicatively) and is hard-capped at **4** - higher values are clamped with a warning on stderr.
|
|
204
231
|
|
|
205
|
-
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit.
|
|
232
|
+
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. After each non-final round the **moderator** reads the debaters' latest answers and replies `DONE` (they've converged or fully aired their disagreement) or `CONTINUE`; a `DONE` stops the debate before the cap.
|
|
206
233
|
|
|
207
|
-
**The
|
|
234
|
+
**The verdict.** The moderator reads the full transcript - presented **anonymized and order-shuffled** (so brand/position bias is killed, even when the moderator was a debater) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The verdict is the final block (`──── verdict · moderator <name> · ... ────`).
|
|
208
235
|
|
|
209
|
-
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the
|
|
236
|
+
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the moderator's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", "moderator": "<name>", ...}` record.
|
|
210
237
|
|
|
211
|
-
**Safety.** Debaters and the
|
|
238
|
+
**Safety.** Debaters and the moderator run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
|
|
212
239
|
|
|
213
|
-
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters
|
|
240
|
+
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters × rounds` calls, plus a moderator check per round and the verdict) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The moderator and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
|
|
214
241
|
|
|
215
242
|
### Attribution policy
|
|
216
243
|
|
|
@@ -231,6 +258,13 @@ Invocations below show the default (read-only) flags; `--yolo` swaps in each too
|
|
|
231
258
|
|
|
232
259
|
Adding a new agent is a single entry in the `PROVIDERS` table in `src/moa_cli/cli.py` (executable, default model, command builder, permission flags); it then participates in detection, `-n` selection, and `distill` automatically.
|
|
233
260
|
|
|
261
|
+
## Agent skill
|
|
262
|
+
|
|
263
|
+
If you drive MOA from an agent (e.g. Claude Code), there's a ready-made skill at
|
|
264
|
+
[`skills/moa/SKILL.md`](skills/moa/SKILL.md): it tells an agent when to reach for MOA and
|
|
265
|
+
how to use it (verb choice, self-exclusion via `-x <self>`, parsing the JSONL output). It
|
|
266
|
+
supersedes hand-rolling a "peer review" skill.
|
|
267
|
+
|
|
234
268
|
## Development
|
|
235
269
|
|
|
236
270
|
```bash
|
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
|
|
9
9
|
# MOA - Mixture of Agents
|
|
10
10
|
|
|
11
|
-
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds
|
|
11
|
+
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds while a moderator checks for convergence and writes the verdict.
|
|
12
12
|
|
|
13
13
|
It's a drop-in, batteries-included replacement for hand-rolling parallel `claude -p` / `codex exec` / `opencode run` calls (or a "peer review" agent skill): one command, clean attributed output, made to be called by a human **or** by another agent.
|
|
14
14
|
|
|
@@ -25,17 +25,44 @@ Or run it once without installing:
|
|
|
25
25
|
uvx --from moa-cli moa ask "Review this plan."
|
|
26
26
|
```
|
|
27
27
|
|
|
28
|
+
> **Requirements.** MOA drives agent CLIs you install separately - it ships no model
|
|
29
|
+
> or API key of its own. You need at least two of `claude` (Claude Code), `codex`,
|
|
30
|
+
> `agy` (Antigravity), and `opencode` on your `PATH` and logged in. Run **`moa doctor`**
|
|
31
|
+
> first to see which ones MOA can find; with only one installed, the "council" collapses
|
|
32
|
+
> to a single answer.
|
|
33
|
+
|
|
28
34
|
## Why
|
|
29
35
|
|
|
30
36
|
A single model gives you one perspective. Asking three frontier models the same question - and seeing where they agree, diverge, or contradict - is a fast, cheap way to pressure-test an answer. MOA makes that a one-liner using the CLIs you already pay for, with no API keys of its own.
|
|
31
37
|
|
|
38
|
+
### Example
|
|
39
|
+
|
|
40
|
+
```text
|
|
41
|
+
$ moa ask "Is Postgres or SQLite better for a desktop app?"
|
|
42
|
+
Asking claude, codex, agy (timeout 180s, read-only)
|
|
43
|
+
|
|
44
|
+
──────────────── claude (opus) · OK · 3.2s ────────────────
|
|
45
|
+
|
|
46
|
+
For a single-user desktop app, SQLite is almost always the right call:
|
|
47
|
+
zero-config, serverless, the whole DB is one file you can ship... [trimmed]
|
|
48
|
+
|
|
49
|
+
─────────────── codex (gpt-5.5) · OK · 4.1s ───────────────
|
|
50
|
+
|
|
51
|
+
Use SQLite unless you expect concurrent writers or need network access.
|
|
52
|
+
For a desktop app neither is likely, so SQLite wins on simplicity... [trimmed]
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
The selection note goes to stderr; the attributed answers go to stdout. In a terminal
|
|
56
|
+
each answer gets the rule shown above; when piped or read by another agent, the same
|
|
57
|
+
blocks render as plain `## ...` headings. Add `--json` for machine-readable JSONL.
|
|
58
|
+
|
|
32
59
|
## Usage
|
|
33
60
|
|
|
34
61
|
MOA has three prompt verbs that share the same selection/output options:
|
|
35
62
|
|
|
36
63
|
- **`moa ask PROMPT`** - council / peer review: N agents answer the same prompt in parallel; every answer is returned with attribution, streamed as it lands.
|
|
37
64
|
- **`moa distill PROMPT`** - synthesis: run the council, then one strong aggregator merges the answers into a single unified response.
|
|
38
|
-
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds,
|
|
65
|
+
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, with a moderator that checks for convergence between rounds and writes the final verdict. The costliest mode; read the caveats below before reaching for it.
|
|
39
66
|
|
|
40
67
|
```bash
|
|
41
68
|
moa doctor # show installed CLIs and their default models
|
|
@@ -49,12 +76,12 @@ moa ask --json "..." # machine-readable JSONL (for agents
|
|
|
49
76
|
git diff | moa ask -f - "Review this diff." # read the prompt from stdin
|
|
50
77
|
moa distill "Design a rate limiter." # council, then merge into one answer
|
|
51
78
|
moa distill -s codex "..." # pick who distills (auto | random | provider)
|
|
52
|
-
moa debate "Is this race condition real?" # 2 debaters
|
|
79
|
+
moa debate "Is this race condition real?" # 2 debaters; the first also moderates (default 2 agents)
|
|
53
80
|
moa debate -r 3 "..." # more rounds (default 2, hard max 4)
|
|
54
|
-
moa debate
|
|
81
|
+
moa debate --moderator agy "..." # pin a neutral moderator (a non-debater)
|
|
55
82
|
```
|
|
56
83
|
|
|
57
|
-
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and
|
|
84
|
+
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `--moderator`.
|
|
58
85
|
|
|
59
86
|
### Read-only by default
|
|
60
87
|
|
|
@@ -165,17 +192,17 @@ moa config unset num # remove a key
|
|
|
165
192
|
moa config unset model claude # remove one [models] entry
|
|
166
193
|
```
|
|
167
194
|
|
|
168
|
-
The
|
|
195
|
+
The role defaults are persistable too: the distill `synthesizer` and the debate `moderator` (e.g. `moa config set synthesizer codex`, `moa config set moderator agy`). `debate`'s `-r/--rounds` is not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
|
|
169
196
|
|
|
170
197
|
### Output
|
|
171
198
|
|
|
172
|
-
- **stdout** carries only content
|
|
199
|
+
- **stdout** carries only content. In a terminal, each agent's answer is fronted by a centered box-drawing rule naming it (`──── claude (opus) · OK · 3.5s ────`) with blank lines for separation, flushed the instant that agent finishes. When stdout is **piped or read by an agent** (not a TTY), the same block renders as a plain, low-noise `## claude (opus) · OK · 3.5s` heading instead - no box-drawing. `moa distill` emits only the final merged block.
|
|
173
200
|
- **stderr** carries progress and selection notes (`Asking claude, codex ...`), so piping stdout stays clean.
|
|
174
|
-
- `--json` emits one JSON object per line (JSONL): a `{"type": "response", ...}` record per agent as it completes; `distill`
|
|
201
|
+
- `--json` emits one JSON object per line (JSONL): `ask` writes a `{"type": "response", ...}` record per agent as it completes; `distill` writes a single `{"type": "synthesis", ...}` record (only the merged answer); `debate` writes a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", ...}` record. Ideal when another agent calls MOA and parses the result.
|
|
175
202
|
|
|
176
203
|
### `moa distill` (synthesis)
|
|
177
204
|
|
|
178
|
-
`distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. It needs at least two successful proposer answers; with fewer it
|
|
205
|
+
`distill` runs the same council fan-out as `ask`, then one more pass where a strong aggregator merges the collected answers into a single, unified answer. **It returns only that merged answer** - the individual proposer responses are intermediates and are not printed (each one's arrival is noted on stderr so the wait isn't silent). It needs at least two successful proposer answers; with fewer it skips the merge and says so on stderr. The aggregator is chosen with `-s/--synthesizer`:
|
|
179
206
|
|
|
180
207
|
- `auto` (default) - the highest-priority agent that ran (deterministic)
|
|
181
208
|
- `random` - pick one of the agents that ran, at random
|
|
@@ -183,23 +210,23 @@ The synthesizer default is persistable too (e.g. `moa config set synthesizer cod
|
|
|
183
210
|
|
|
184
211
|
The aggregator prompt is adapted from the Mixture-of-Agents "Aggregate-and-Synthesize" prompt (Wang et al. 2024): it tells the aggregator to critically evaluate the inputs (some may be biased or incorrect) and not to simply replicate them but offer a refined, accurate, comprehensive reply.
|
|
185
212
|
|
|
186
|
-
### `moa debate` (sequential debate +
|
|
213
|
+
### `moa debate` (sequential debate + moderator)
|
|
187
214
|
|
|
188
|
-
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange
|
|
215
|
+
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange overseen by a **moderator** that checks for convergence between rounds and writes the final answer.
|
|
189
216
|
|
|
190
|
-
**Roles.**
|
|
217
|
+
**Roles.** The top **2** selected agents are the debaters. The **moderator** runs the per-round convergence check and writes the verdict; by default it is the top-priority selected agent (so the default 2-agent debate has agent #1 also moderate). Debate only needs **2 agents**; with fewer it exits cleanly rather than silently degrading. For a **neutral** moderator that doesn't also debate, select a third agent and pin it: `moa debate -n 3 --moderator <provider>` (the moderator must be one of the selected agents). The moderator only ever sees the transcript **anonymized + shuffled**, so even when it is itself a debater it can't favour its own answer.
|
|
191
218
|
|
|
192
219
|
**Rounds.** `-r/--rounds` defaults to **2** (gains plateau around 2-3 rounds while token cost grows multiplicatively) and is hard-capped at **4** - higher values are clamped with a warning on stderr.
|
|
193
220
|
|
|
194
|
-
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit.
|
|
221
|
+
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. After each non-final round the **moderator** reads the debaters' latest answers and replies `DONE` (they've converged or fully aired their disagreement) or `CONTINUE`; a `DONE` stops the debate before the cap.
|
|
195
222
|
|
|
196
|
-
**The
|
|
223
|
+
**The verdict.** The moderator reads the full transcript - presented **anonymized and order-shuffled** (so brand/position bias is killed, even when the moderator was a debater) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The verdict is the final block (`──── verdict · moderator <name> · ... ────`).
|
|
197
224
|
|
|
198
|
-
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the
|
|
225
|
+
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the moderator's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", "moderator": "<name>", ...}` record.
|
|
199
226
|
|
|
200
|
-
**Safety.** Debaters and the
|
|
227
|
+
**Safety.** Debaters and the moderator run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
|
|
201
228
|
|
|
202
|
-
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters
|
|
229
|
+
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters × rounds` calls, plus a moderator check per round and the verdict) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The moderator and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
|
|
203
230
|
|
|
204
231
|
### Attribution policy
|
|
205
232
|
|
|
@@ -220,6 +247,13 @@ Invocations below show the default (read-only) flags; `--yolo` swaps in each too
|
|
|
220
247
|
|
|
221
248
|
Adding a new agent is a single entry in the `PROVIDERS` table in `src/moa_cli/cli.py` (executable, default model, command builder, permission flags); it then participates in detection, `-n` selection, and `distill` automatically.
|
|
222
249
|
|
|
250
|
+
## Agent skill
|
|
251
|
+
|
|
252
|
+
If you drive MOA from an agent (e.g. Claude Code), there's a ready-made skill at
|
|
253
|
+
[`skills/moa/SKILL.md`](skills/moa/SKILL.md): it tells an agent when to reach for MOA and
|
|
254
|
+
how to use it (verb choice, self-exclusion via `-x <self>`, parsing the JSONL output). It
|
|
255
|
+
supersedes hand-rolling a "peer review" skill.
|
|
256
|
+
|
|
223
257
|
## Development
|
|
224
258
|
|
|
225
259
|
```bash
|
|
@@ -377,11 +377,12 @@ def build_synthesis_prompt(
|
|
|
377
377
|
|
|
378
378
|
|
|
379
379
|
# --------------------------------------------------------------------------- #
|
|
380
|
-
# Debate: sequential adversarial rounds,
|
|
381
|
-
#
|
|
382
|
-
#
|
|
383
|
-
#
|
|
384
|
-
#
|
|
380
|
+
# Debate: sequential adversarial rounds, with a moderator that checks for
|
|
381
|
+
# convergence after each round and then writes the verdict from the (anonymized
|
|
382
|
+
# + shuffled) full transcript. The literature is clear that debate is the
|
|
383
|
+
# costliest and least reliably-beneficial mode: it can converge on a wrong answer
|
|
384
|
+
# (conformity), so the verdict prompt weighs correctness/evidence over confidence
|
|
385
|
+
# and fluency, and the anonymization holds even when the moderator also debated.
|
|
385
386
|
# --------------------------------------------------------------------------- #
|
|
386
387
|
|
|
387
388
|
ROUNDS_MAX = 4
|
|
@@ -393,23 +394,16 @@ ADVERSARIAL_INSTRUCTION = """Before giving your own answer, critically examine t
|
|
|
393
394
|
other participant's answer above: identify any errors, weaknesses, unsupported claims, or \
|
|
394
395
|
gaps in reasoning. Do NOT agree merely to reach consensus - only concede a point if it is \
|
|
395
396
|
genuinely correct. Then give your own best, complete answer to the original question, \
|
|
396
|
-
incorporating any valid corrections.
|
|
397
|
+
incorporating any valid corrections."""
|
|
397
398
|
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
|
|
401
|
-
|
|
402
|
-
|
|
403
|
-
|
|
404
|
-
|
|
405
|
-
|
|
406
|
-
# The neutral judge reads the full transcript (anonymized + shuffled) and writes
|
|
407
|
-
# the final answer. It must weigh correctness/evidence over confidence/fluency -
|
|
408
|
-
# this is where conformity-to-a-wrong-answer is most dangerous, so the judge
|
|
409
|
-
# never just echoes the most fluent or most confident debater.
|
|
410
|
-
JUDGE_PROMPT = """You are a neutral judge. Below is a transcript of a debate between AI coding \
|
|
411
|
-
assistants who answered the user's question and then critiqued each other's answers across \
|
|
412
|
-
several rounds. The participants are anonymized and presented in arbitrary order.
|
|
399
|
+
# The moderator reads the full transcript (anonymized + shuffled) and writes the
|
|
400
|
+
# final answer. It must weigh correctness/evidence over confidence/fluency - this
|
|
401
|
+
# is where conformity-to-a-wrong-answer is most dangerous, so it never just echoes
|
|
402
|
+
# the most fluent or most confident debater.
|
|
403
|
+
MODERATOR_VERDICT_PROMPT = """You are the moderator of this debate. Below is a transcript of a \
|
|
404
|
+
debate between AI coding assistants who answered the user's question and then critiqued each \
|
|
405
|
+
other's answers across several rounds. The participants are anonymized and presented in \
|
|
406
|
+
arbitrary order.
|
|
413
407
|
|
|
414
408
|
Your task is to read the full debate and write the single best, final answer to the user's \
|
|
415
409
|
question. Weigh correctness and the strength of evidence and reasoning ABOVE confidence, \
|
|
@@ -424,43 +418,50 @@ best possible answer.
|
|
|
424
418
|
asserted.
|
|
425
419
|
- Do not invent information that the debate does not support."""
|
|
426
420
|
|
|
421
|
+
# After each non-final round the moderator decides whether another round would
|
|
422
|
+
# materially help. It replies with a single leading word the caller branches on.
|
|
423
|
+
CONVERGENCE_DONE = "DONE"
|
|
424
|
+
MODERATOR_CONVERGENCE_PROMPT = """You are the moderator of this debate. Below are the debaters' \
|
|
425
|
+
latest answers to the user's question, anonymized. Decide whether they have converged on an \
|
|
426
|
+
answer, or at least fully aired and clarified their disagreement, so that another round would \
|
|
427
|
+
add nothing material.
|
|
428
|
+
|
|
429
|
+
Reply with EXACTLY one word on the first line: DONE if the debate should stop now, or CONTINUE \
|
|
430
|
+
if another round would materially improve the final answer. Add nothing else."""
|
|
431
|
+
|
|
427
432
|
|
|
428
433
|
def assign_debate_roles(
|
|
429
|
-
selected: list[Provider],
|
|
434
|
+
selected: list[Provider], moderator: str | None
|
|
430
435
|
) -> tuple[list[Provider], Provider]:
|
|
431
|
-
"""Split the selected providers into (debaters,
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
|
|
436
|
+
"""Split the selected providers into (debaters, moderator).
|
|
437
|
+
|
|
438
|
+
The top 2 selected providers debate. The moderator runs the per-round
|
|
439
|
+
convergence check and writes the final verdict; it MAY be one of the debaters.
|
|
440
|
+
`moderator` is "auto" (or None) -> the top-priority selected provider (so the
|
|
441
|
+
default 2-agent debate has agent #1 also moderate), or a provider name that
|
|
442
|
+
must be among the selected providers (pin a non-debating 3rd for a neutral
|
|
443
|
+
moderator). Requires at least 2 selected providers; raises ValueError
|
|
444
|
+
otherwise (the caller turns this into a clean exit - debate never silently
|
|
445
|
+
degrades).
|
|
439
446
|
"""
|
|
440
|
-
if
|
|
441
|
-
names = [p.name for p in selected]
|
|
442
|
-
if judge not in PROVIDERS:
|
|
443
|
-
raise ValueError(f"Unknown judge: {judge}")
|
|
444
|
-
if judge not in names:
|
|
445
|
-
raise ValueError(
|
|
446
|
-
f"Judge {judge!r} is not among the selected providers ({', '.join(names)}). "
|
|
447
|
-
f"Pin it with -p {judge} or widen the selection."
|
|
448
|
-
)
|
|
449
|
-
judge_provider = next(p for p in selected if p.name == judge)
|
|
450
|
-
debaters = [p for p in selected if p.name != judge]
|
|
451
|
-
if len(debaters) < 2:
|
|
452
|
-
raise ValueError(
|
|
453
|
-
f"debate needs at least 2 debaters plus the judge ({judge}); only "
|
|
454
|
-
f"{len(debaters)} non-judge provider(s) available. Increase -n or -p."
|
|
455
|
-
)
|
|
456
|
-
return debaters, judge_provider
|
|
457
|
-
|
|
458
|
-
if len(selected) < 3:
|
|
447
|
+
if len(selected) < 2:
|
|
459
448
|
raise ValueError(
|
|
460
|
-
f"debate needs at least
|
|
461
|
-
f"
|
|
449
|
+
f"debate needs at least 2 providers (2 debaters); only {len(selected)} available. "
|
|
450
|
+
f"Increase -n, pin more with -p, or install more agents."
|
|
462
451
|
)
|
|
463
|
-
|
|
452
|
+
debaters = selected[:2]
|
|
453
|
+
if moderator in (None, "auto"):
|
|
454
|
+
return debaters, selected[0]
|
|
455
|
+
|
|
456
|
+
names = [p.name for p in selected]
|
|
457
|
+
if moderator not in PROVIDERS:
|
|
458
|
+
raise ValueError(f"Unknown moderator: {moderator}")
|
|
459
|
+
if moderator not in names:
|
|
460
|
+
raise ValueError(
|
|
461
|
+
f"Moderator {moderator!r} is not among the selected providers ({', '.join(names)}). "
|
|
462
|
+
f"Pin it with -p {moderator} or widen the selection."
|
|
463
|
+
)
|
|
464
|
+
return debaters, next(p for p in selected if p.name == moderator)
|
|
464
465
|
|
|
465
466
|
|
|
466
467
|
def clamp_rounds(rounds: int) -> tuple[int, str | None]:
|
|
@@ -496,18 +497,19 @@ def build_debate_turn_prompt(
|
|
|
496
497
|
)
|
|
497
498
|
|
|
498
499
|
|
|
499
|
-
def
|
|
500
|
+
def build_verdict_prompt(
|
|
500
501
|
question: str,
|
|
501
502
|
transcript: list[RunResult],
|
|
502
503
|
rng: random.Random | None = None,
|
|
503
504
|
) -> tuple[str, dict[str, str]]:
|
|
504
|
-
"""Build the
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
|
|
508
|
-
|
|
509
|
-
|
|
510
|
-
|
|
505
|
+
"""Build the moderator's final-verdict prompt from the transcript, anonymized
|
|
506
|
+
+ shuffled.
|
|
507
|
+
|
|
508
|
+
The transcript is the per-turn RunResults; the moderator sees only the final
|
|
509
|
+
answer text of each turn, relabelled "Participant 1/2/.." in shuffled order so
|
|
510
|
+
brand/position bias is killed - this matters even when the moderator is itself
|
|
511
|
+
a debater, since it can't tell which answer is its own. The label_map maps each
|
|
512
|
+
label back to the real provider for the caller, though debate never reveals it.
|
|
511
513
|
"""
|
|
512
514
|
turns = [r for r in transcript if r.status == "ok"]
|
|
513
515
|
shuffled = list(turns)
|
|
@@ -519,13 +521,27 @@ def build_judge_prompt(
|
|
|
519
521
|
sections.append(f"### {label}\n\n{result.stdout.strip()}")
|
|
520
522
|
label_map[label] = result.provider
|
|
521
523
|
prompt = (
|
|
522
|
-
f"{
|
|
524
|
+
f"{MODERATOR_VERDICT_PROMPT}\n\n"
|
|
523
525
|
f"## User question\n\n{question}\n\n"
|
|
524
526
|
f"## Debate transcript\n\n" + "\n\n".join(sections) + "\n\n## Your final answer\n"
|
|
525
527
|
)
|
|
526
528
|
return prompt, label_map
|
|
527
529
|
|
|
528
530
|
|
|
531
|
+
def build_convergence_prompt(question: str, latest: list[RunResult]) -> str:
|
|
532
|
+
"""The moderator's per-round convergence check. `latest` is the debaters' most
|
|
533
|
+
recent answers, anonymized so the moderator judges substance over brand. The
|
|
534
|
+
expected reply starts with DONE (stop) or CONTINUE (another round helps)."""
|
|
535
|
+
answers = "\n\n".join(
|
|
536
|
+
f"### Participant {i + 1}\n\n{r.stdout.strip()}" for i, r in enumerate(latest)
|
|
537
|
+
)
|
|
538
|
+
return (
|
|
539
|
+
f"{MODERATOR_CONVERGENCE_PROMPT}\n\n"
|
|
540
|
+
f"## User question\n\n{question}\n\n"
|
|
541
|
+
f"## The debaters' latest answers\n\n{answers}\n\n## Your decision\n"
|
|
542
|
+
)
|
|
543
|
+
|
|
544
|
+
|
|
529
545
|
# --------------------------------------------------------------------------- #
|
|
530
546
|
# Render: stdout carries content (Markdown or JSONL); stderr carries progress.
|
|
531
547
|
# --------------------------------------------------------------------------- #
|
|
@@ -560,21 +576,36 @@ def _body(result: RunResult) -> list[str]:
|
|
|
560
576
|
return ["```text", detail[-1200:], "```", ""]
|
|
561
577
|
|
|
562
578
|
|
|
563
|
-
def
|
|
564
|
-
"""
|
|
565
|
-
|
|
579
|
+
def _plain_output() -> bool:
|
|
580
|
+
"""True when stdout is not an interactive terminal - piped, redirected, or
|
|
581
|
+
read by another agent (the common "an agent shells out to moa" case). There
|
|
582
|
+
we drop the decorative box-drawing rule and extra blank lines for a plain,
|
|
583
|
+
low-noise `## label` heading that is cheaper for a model to consume."""
|
|
584
|
+
return not sys.stdout.isatty()
|
|
585
|
+
|
|
586
|
+
|
|
587
|
+
def _render(label: str, result: RunResult, plain: bool) -> str:
|
|
588
|
+
"""One answer block. In a terminal: two leading blank lines and a centered
|
|
589
|
+
box-drawing rule, for clear visual separation as blocks stream in. When
|
|
590
|
+
piped: a plain `## label` heading with a single blank line, no box-drawing."""
|
|
591
|
+
if plain:
|
|
592
|
+
return "\n".join(["", f"## {label}", "", *_body(result)])
|
|
566
593
|
return "\n".join(["", "", _rule(label), "", *_body(result)])
|
|
567
594
|
|
|
568
595
|
|
|
569
|
-
def render_block(result: RunResult) -> str:
|
|
596
|
+
def render_block(result: RunResult, plain: bool | None = None) -> str:
|
|
597
|
+
if plain is None:
|
|
598
|
+
plain = _plain_output()
|
|
570
599
|
model = f" ({result.model})" if result.model else ""
|
|
571
600
|
label = f"{result.provider}{model} · {_status_label(result.status)} · {result.elapsed:.1f}s"
|
|
572
|
-
return _render(label, result)
|
|
601
|
+
return _render(label, result, plain)
|
|
573
602
|
|
|
574
603
|
|
|
575
|
-
def render_synthesis_block(result: RunResult, synthesizer: str) -> str:
|
|
604
|
+
def render_synthesis_block(result: RunResult, synthesizer: str, plain: bool | None = None) -> str:
|
|
605
|
+
if plain is None:
|
|
606
|
+
plain = _plain_output()
|
|
576
607
|
label = f"synthesis · via {synthesizer} · {_status_label(result.status)} · {result.elapsed:.1f}s"
|
|
577
|
-
return _render(label, result)
|
|
608
|
+
return _render(label, result, plain)
|
|
578
609
|
|
|
579
610
|
|
|
580
611
|
def result_record(result: RunResult) -> dict:
|
|
@@ -601,18 +632,22 @@ def synthesis_record(result: RunResult, synthesizer: str) -> dict:
|
|
|
601
632
|
}
|
|
602
633
|
|
|
603
634
|
|
|
604
|
-
def render_debate_turn_block(result: RunResult, round_num: int) -> str:
|
|
635
|
+
def render_debate_turn_block(result: RunResult, round_num: int, plain: bool | None = None) -> str:
|
|
636
|
+
if plain is None:
|
|
637
|
+
plain = _plain_output()
|
|
605
638
|
model = f" ({result.model})" if result.model else ""
|
|
606
639
|
label = (
|
|
607
640
|
f"round {round_num} · {result.provider}{model} · "
|
|
608
641
|
f"{_status_label(result.status)} · {result.elapsed:.1f}s"
|
|
609
642
|
)
|
|
610
|
-
return _render(label, result)
|
|
643
|
+
return _render(label, result, plain)
|
|
611
644
|
|
|
612
645
|
|
|
613
|
-
def
|
|
614
|
-
|
|
615
|
-
|
|
646
|
+
def render_verdict_block(result: RunResult, moderator: str, plain: bool | None = None) -> str:
|
|
647
|
+
if plain is None:
|
|
648
|
+
plain = _plain_output()
|
|
649
|
+
label = f"verdict · moderator {moderator} · {_status_label(result.status)} · {result.elapsed:.1f}s"
|
|
650
|
+
return _render(label, result, plain)
|
|
616
651
|
|
|
617
652
|
|
|
618
653
|
def debate_turn_record(result: RunResult, round_num: int) -> dict:
|
|
@@ -629,10 +664,10 @@ def debate_turn_record(result: RunResult, round_num: int) -> dict:
|
|
|
629
664
|
}
|
|
630
665
|
|
|
631
666
|
|
|
632
|
-
def
|
|
667
|
+
def verdict_record(result: RunResult, moderator: str) -> dict:
|
|
633
668
|
return {
|
|
634
669
|
"type": "verdict",
|
|
635
|
-
"
|
|
670
|
+
"moderator": moderator,
|
|
636
671
|
"status": result.status,
|
|
637
672
|
"elapsed": round(result.elapsed, 3),
|
|
638
673
|
"text": result.stdout,
|
|
@@ -653,12 +688,21 @@ def judge_record(result: RunResult, judge: str) -> dict:
|
|
|
653
688
|
|
|
654
689
|
# Scalar config keys and the type each maps to. `exclude` (list[str]) and the
|
|
655
690
|
# `[models]` table are handled separately because they aren't plain scalars.
|
|
656
|
-
_CONFIG_SCALARS: dict[str, type] = {"num": int, "timeout": float, "synthesizer": str}
|
|
691
|
+
_CONFIG_SCALARS: dict[str, type] = {"num": int, "timeout": float, "synthesizer": str, "moderator": str}
|
|
657
692
|
_CONFIG_KEYS: tuple[str, ...] = (*_CONFIG_SCALARS, "exclude", "models")
|
|
658
693
|
# Synthesizer accepts the special modes plus any known provider name.
|
|
659
694
|
_SYNTHESIZER_MODES: tuple[str, ...] = ("auto", "first", "random")
|
|
695
|
+
# Moderator accepts "auto" (the top-priority selected agent) or a provider name.
|
|
696
|
+
_MODERATOR_MODES: tuple[str, ...] = ("auto",)
|
|
660
697
|
# The built-in defaults, shown by `config show` when a key isn't in the file.
|
|
661
|
-
_CONFIG_DEFAULTS: dict = {
|
|
698
|
+
_CONFIG_DEFAULTS: dict = {
|
|
699
|
+
"num": 3,
|
|
700
|
+
"timeout": 180.0,
|
|
701
|
+
"synthesizer": "auto",
|
|
702
|
+
"moderator": "auto",
|
|
703
|
+
"exclude": [],
|
|
704
|
+
"models": {},
|
|
705
|
+
}
|
|
662
706
|
|
|
663
707
|
|
|
664
708
|
def config_dir() -> Path:
|
|
@@ -690,6 +734,9 @@ def _validate_scalar(key: str, value) -> None:
|
|
|
690
734
|
if key == "synthesizer" and value not in (*_SYNTHESIZER_MODES, *PROVIDERS):
|
|
691
735
|
allowed = ", ".join((*_SYNTHESIZER_MODES, *PROVIDERS))
|
|
692
736
|
raise ValueError(f"synthesizer must be one of: {allowed}.")
|
|
737
|
+
if key == "moderator" and value not in (*_MODERATOR_MODES, *PROVIDERS):
|
|
738
|
+
allowed = ", ".join((*_MODERATOR_MODES, *PROVIDERS))
|
|
739
|
+
raise ValueError(f"moderator must be one of: {allowed}.")
|
|
693
740
|
|
|
694
741
|
|
|
695
742
|
def load_config() -> dict:
|
|
@@ -767,6 +814,8 @@ def serialize_config(config: dict) -> str:
|
|
|
767
814
|
lines.append(f"timeout = {int(timeout) if timeout.is_integer() else timeout!r}")
|
|
768
815
|
if "synthesizer" in config:
|
|
769
816
|
lines.append(f"synthesizer = {_toml_str(config['synthesizer'])}")
|
|
817
|
+
if "moderator" in config:
|
|
818
|
+
lines.append(f"moderator = {_toml_str(config['moderator'])}")
|
|
770
819
|
if "exclude" in config:
|
|
771
820
|
items = ", ".join(_toml_str(v) for v in config["exclude"])
|
|
772
821
|
lines.append(f"exclude = [{items}]")
|
|
@@ -877,11 +926,20 @@ async def _collect(
|
|
|
877
926
|
json_output: bool,
|
|
878
927
|
models: dict[str, str] | None = None,
|
|
879
928
|
yolo: bool = False,
|
|
929
|
+
emit_blocks: bool = True,
|
|
880
930
|
) -> list[RunResult]:
|
|
931
|
+
"""Gather every agent's result. With emit_blocks (ask), each complete answer
|
|
932
|
+
is flushed to stdout the instant it arrives. Without it (distill), the
|
|
933
|
+
individual answers are intermediates the user shouldn't see - only the final
|
|
934
|
+
distilled block is content - so we keep stdout clean and just heartbeat each
|
|
935
|
+
arrival to stderr so a multi-agent run doesn't look frozen while it waits."""
|
|
881
936
|
results: list[RunResult] = []
|
|
882
937
|
async for result in stream(providers, prompt, timeout, models, yolo):
|
|
883
938
|
results.append(result)
|
|
884
|
-
|
|
939
|
+
if emit_blocks:
|
|
940
|
+
_emit(json.dumps(result_record(result)) if json_output else render_block(result))
|
|
941
|
+
else:
|
|
942
|
+
_note(f" {result.provider} responded ({_status_label(result.status)}, {result.elapsed:.1f}s)")
|
|
885
943
|
return results
|
|
886
944
|
|
|
887
945
|
|
|
@@ -945,6 +1003,7 @@ def resolve_run(
|
|
|
945
1003
|
timeout: float | None,
|
|
946
1004
|
json_output: bool,
|
|
947
1005
|
yolo: bool,
|
|
1006
|
+
default_num: int = 3,
|
|
948
1007
|
) -> RunConfig:
|
|
949
1008
|
"""Resolve the shared options into a RunConfig, emitting the selection note.
|
|
950
1009
|
|
|
@@ -953,7 +1012,9 @@ def resolve_run(
|
|
|
953
1012
|
flag), parse model overrides, select providers, and print the stderr
|
|
954
1013
|
selection note (including agy's honest partial-protection note). Every verb
|
|
955
1014
|
picks up config defaults identically because the merge lives only here.
|
|
956
|
-
|
|
1015
|
+
`default_num` is the built-in fallback when neither flag nor config sets num
|
|
1016
|
+
(debate passes 2, since it only needs 2 agents). Raises typer.BadParameter on
|
|
1017
|
+
bad input and typer.Exit(1) when nothing runs.
|
|
957
1018
|
"""
|
|
958
1019
|
prompt_text = _read_prompt(prompt, file)
|
|
959
1020
|
if not prompt_text:
|
|
@@ -965,7 +1026,7 @@ def resolve_run(
|
|
|
965
1026
|
except ValueError as exc:
|
|
966
1027
|
raise typer.BadParameter(f"{config_path()}: {exc}") from exc
|
|
967
1028
|
|
|
968
|
-
num = resolve_option(num, "num", config,
|
|
1029
|
+
num = resolve_option(num, "num", config, default_num)
|
|
969
1030
|
timeout = resolve_option(timeout, "timeout", config, 180.0)
|
|
970
1031
|
# Repeatable flags are an empty list when omitted, not None, so treat empty
|
|
971
1032
|
# as "fall back to config" for exclude.
|
|
@@ -1051,8 +1112,13 @@ def distill(
|
|
|
1051
1112
|
# it merges through the same precedence: CLI flag > config file > built-in.
|
|
1052
1113
|
synthesizer = resolve_option(synthesizer, "synthesizer", _read_config_or_empty(), "auto")
|
|
1053
1114
|
|
|
1115
|
+
# distill returns only the merged answer, so the proposer responses are
|
|
1116
|
+
# intermediates: collect them without printing each to stdout.
|
|
1054
1117
|
results = asyncio.run(
|
|
1055
|
-
_collect(
|
|
1118
|
+
_collect(
|
|
1119
|
+
cfg.selected, cfg.prompt, cfg.timeout, cfg.json_output, cfg.models, cfg.yolo,
|
|
1120
|
+
emit_blocks=False,
|
|
1121
|
+
)
|
|
1056
1122
|
)
|
|
1057
1123
|
successes = [r for r in results if r.status == "ok"]
|
|
1058
1124
|
|
|
@@ -1098,9 +1164,12 @@ def _run_synthesis(
|
|
|
1098
1164
|
RoundsOpt = Annotated[
|
|
1099
1165
|
int, typer.Option("--rounds", "-r", help=f"Debate rounds (default 2, hard max {ROUNDS_MAX}).")
|
|
1100
1166
|
]
|
|
1101
|
-
|
|
1167
|
+
ModeratorOpt = Annotated[
|
|
1102
1168
|
str | None,
|
|
1103
|
-
typer.Option(
|
|
1169
|
+
typer.Option(
|
|
1170
|
+
"--moderator", "-j",
|
|
1171
|
+
help="Moderator that checks convergence and writes the verdict: auto | a provider.",
|
|
1172
|
+
),
|
|
1104
1173
|
]
|
|
1105
1174
|
|
|
1106
1175
|
|
|
@@ -1114,60 +1183,80 @@ def debate(
|
|
|
1114
1183
|
file: FileOpt = None,
|
|
1115
1184
|
timeout: TimeoutOpt = None,
|
|
1116
1185
|
rounds: RoundsOpt = 2,
|
|
1117
|
-
|
|
1186
|
+
moderator: ModeratorOpt = None,
|
|
1118
1187
|
json_output: JsonOpt = False,
|
|
1119
1188
|
yolo: YoloOpt = False,
|
|
1120
1189
|
) -> None:
|
|
1121
|
-
"""Debate: debaters answer and critique each other across rounds; a
|
|
1122
|
-
|
|
1190
|
+
"""Debate: two debaters answer and critique each other across rounds; a moderator checks convergence and writes the verdict."""
|
|
1191
|
+
# Debate only needs 2 agents (the moderator may also be a debater), so its
|
|
1192
|
+
# built-in default selection is 2, not the usual 3.
|
|
1193
|
+
cfg = resolve_run(
|
|
1194
|
+
prompt, file, num, provider, exclude, model, timeout, json_output, yolo, default_num=2
|
|
1195
|
+
)
|
|
1196
|
+
|
|
1197
|
+
# moderator is verb-specific (like distill's synthesizer) but persistable, so
|
|
1198
|
+
# it merges through the same precedence: CLI flag > config file > built-in.
|
|
1199
|
+
moderator = resolve_option(moderator, "moderator", _read_config_or_empty(), "auto")
|
|
1123
1200
|
|
|
1124
1201
|
rounds, warning = clamp_rounds(rounds)
|
|
1125
1202
|
if warning:
|
|
1126
1203
|
_note(warning)
|
|
1127
1204
|
|
|
1128
1205
|
try:
|
|
1129
|
-
debaters,
|
|
1206
|
+
debaters, moderator_provider = assign_debate_roles(cfg.selected, moderator)
|
|
1130
1207
|
except ValueError as exc:
|
|
1131
1208
|
_note(f"debate: {exc}")
|
|
1132
1209
|
raise typer.Exit(code=1) from exc
|
|
1133
1210
|
|
|
1134
1211
|
_note(
|
|
1135
|
-
f"Debating: {', '.join(p.name for p in debaters)} over {rounds} round(s), "
|
|
1136
|
-
f"
|
|
1137
|
-
f"
|
|
1212
|
+
f"Debating: {', '.join(p.name for p in debaters)} over up to {rounds} round(s), "
|
|
1213
|
+
f"moderator {moderator_provider.name}. Debate is the costliest mode and can "
|
|
1214
|
+
f"converge on a wrong answer."
|
|
1138
1215
|
)
|
|
1139
1216
|
|
|
1140
|
-
transcript = asyncio.run(_run_debate(cfg, debaters,
|
|
1217
|
+
transcript = asyncio.run(_run_debate(cfg, debaters, moderator_provider, rounds))
|
|
1141
1218
|
if not any(r.status == "ok" for r in transcript):
|
|
1142
1219
|
raise typer.Exit(code=1)
|
|
1143
1220
|
|
|
1144
1221
|
|
|
1145
|
-
def
|
|
1146
|
-
|
|
1147
|
-
|
|
1222
|
+
async def _moderator_signals_done(
|
|
1223
|
+
cfg: RunConfig, moderator: Provider, latest_ok: list[RunResult], round_num: int
|
|
1224
|
+
) -> bool:
|
|
1225
|
+
"""Ask the moderator whether the debate has converged. Returns True (stop)
|
|
1226
|
+
only on a clean DONE reply; a failed or CONTINUE check keeps debating."""
|
|
1227
|
+
prompt = build_convergence_prompt(cfg.prompt, latest_ok)
|
|
1228
|
+
_note(f"Round {round_num}: moderator {moderator.name} checking for convergence...")
|
|
1229
|
+
result = await run_provider(
|
|
1230
|
+
moderator, prompt, cfg.timeout, cfg.models.get(moderator.name), cfg.yolo
|
|
1231
|
+
)
|
|
1232
|
+
done = result.status == "ok" and result.stdout.strip().upper().startswith(CONVERGENCE_DONE)
|
|
1233
|
+
if done:
|
|
1234
|
+
_note(f"Moderator {moderator.name}: converged; stopping after round {round_num}.")
|
|
1235
|
+
return done
|
|
1148
1236
|
|
|
1149
1237
|
|
|
1150
1238
|
async def _run_debate(
|
|
1151
1239
|
cfg: RunConfig,
|
|
1152
1240
|
debaters: list[Provider],
|
|
1153
|
-
|
|
1241
|
+
moderator: Provider,
|
|
1154
1242
|
rounds: int,
|
|
1155
1243
|
) -> list[RunResult]:
|
|
1156
|
-
"""Run the sequential debate, then the
|
|
1157
|
-
|
|
1158
|
-
|
|
1159
|
-
|
|
1160
|
-
|
|
1161
|
-
|
|
1162
|
-
|
|
1163
|
-
|
|
1164
|
-
the verdict last
|
|
1244
|
+
"""Run the sequential debate, then the moderator's verdict. Returns the full
|
|
1245
|
+
transcript.
|
|
1246
|
+
|
|
1247
|
+
Each debater keeps its latest answer in `latest`. A turn shows the debater the
|
|
1248
|
+
OTHER debaters' latest answers (anonymized) plus the adversarial instruction;
|
|
1249
|
+
the very first turn (no priors yet) is a cold answer. Turns stream as they
|
|
1250
|
+
complete (stderr progress + stdout/JSON block). After each non-final round the
|
|
1251
|
+
moderator decides whether the debate has converged and can stop early. The
|
|
1252
|
+
moderator then reads the blind+shuffled transcript and writes the verdict last
|
|
1253
|
+
(it may itself be a debater - the anonymization stops it favouring its own
|
|
1254
|
+
answer).
|
|
1165
1255
|
"""
|
|
1166
1256
|
transcript: list[RunResult] = []
|
|
1167
1257
|
latest: dict[str, RunResult] = {}
|
|
1168
1258
|
|
|
1169
1259
|
for round_num in range(1, rounds + 1):
|
|
1170
|
-
converged_this_round = True
|
|
1171
1260
|
for debater in debaters:
|
|
1172
1261
|
prior = [
|
|
1173
1262
|
("the other participant", latest[other.name].stdout)
|
|
@@ -1186,34 +1275,36 @@ async def _run_debate(
|
|
|
1186
1275
|
if cfg.json_output
|
|
1187
1276
|
else render_debate_turn_block(result, round_num)
|
|
1188
1277
|
)
|
|
1189
|
-
# A debater that errors out is not "converged"; only an explicit
|
|
1190
|
-
# no-change signal counts toward an early stop.
|
|
1191
|
-
if not _signals_convergence(result):
|
|
1192
|
-
converged_this_round = False
|
|
1193
1278
|
|
|
1194
|
-
#
|
|
1195
|
-
#
|
|
1196
|
-
if round_num
|
|
1197
|
-
|
|
1198
|
-
|
|
1279
|
+
# After each non-final round, let the moderator stop early if the debaters
|
|
1280
|
+
# have converged. Needs both debaters' latest answers to compare.
|
|
1281
|
+
if round_num < rounds:
|
|
1282
|
+
latest_ok = [
|
|
1283
|
+
latest[d.name] for d in debaters
|
|
1284
|
+
if d.name in latest and latest[d.name].status == "ok"
|
|
1285
|
+
]
|
|
1286
|
+
if len(latest_ok) >= 2 and await _moderator_signals_done(
|
|
1287
|
+
cfg, moderator, latest_ok, round_num
|
|
1288
|
+
):
|
|
1289
|
+
break
|
|
1199
1290
|
|
|
1200
1291
|
if not any(r.status == "ok" for r in transcript):
|
|
1201
|
-
_note("Debate produced no usable answers; skipping
|
|
1292
|
+
_note("Debate produced no usable answers; skipping the moderator verdict.")
|
|
1202
1293
|
return transcript
|
|
1203
1294
|
|
|
1204
|
-
# The
|
|
1205
|
-
# judging;
|
|
1206
|
-
#
|
|
1207
|
-
|
|
1208
|
-
_note(f"
|
|
1295
|
+
# The moderator always sees the transcript anonymized + shuffled (a model is
|
|
1296
|
+
# judging; no toggle). It runs in the same read-only / --yolo mode as the
|
|
1297
|
+
# debaters - no permission bypass.
|
|
1298
|
+
verdict_prompt, _label_map = build_verdict_prompt(cfg.prompt, transcript)
|
|
1299
|
+
_note(f"Moderator {moderator.name} writing the final answer...")
|
|
1209
1300
|
verdict = await run_provider(
|
|
1210
|
-
|
|
1301
|
+
moderator, verdict_prompt, cfg.timeout, cfg.models.get(moderator.name), cfg.yolo
|
|
1211
1302
|
)
|
|
1212
1303
|
transcript.append(verdict)
|
|
1213
1304
|
_emit(
|
|
1214
|
-
json.dumps(
|
|
1305
|
+
json.dumps(verdict_record(verdict, moderator.name))
|
|
1215
1306
|
if cfg.json_output
|
|
1216
|
-
else
|
|
1307
|
+
else render_verdict_block(verdict, moderator.name)
|
|
1217
1308
|
)
|
|
1218
1309
|
return transcript
|
|
1219
1310
|
|
|
@@ -1280,7 +1371,7 @@ def config_show() -> None:
|
|
|
1280
1371
|
|
|
1281
1372
|
@config_app.command("set")
|
|
1282
1373
|
def config_set(
|
|
1283
|
-
key: Annotated[str, typer.Argument(help="Config key: num | timeout | synthesizer | exclude | model.")],
|
|
1374
|
+
key: Annotated[str, typer.Argument(help="Config key: num | timeout | synthesizer | moderator | exclude | model.")],
|
|
1284
1375
|
value: Annotated[str, typer.Argument(help="Value. For models: PROVIDER=MODEL. For exclude: comma-separated names.")],
|
|
1285
1376
|
) -> None:
|
|
1286
1377
|
"""Write a value to the config file, creating the dir/file if missing."""
|
|
@@ -1313,7 +1404,7 @@ def config_set(
|
|
|
1313
1404
|
raise typer.BadParameter(str(exc)) from exc
|
|
1314
1405
|
config[key] = coerced
|
|
1315
1406
|
else:
|
|
1316
|
-
known = "num, timeout, synthesizer, exclude, model"
|
|
1407
|
+
known = "num, timeout, synthesizer, moderator, exclude, model"
|
|
1317
1408
|
raise typer.BadParameter(f"Unknown config key: {key!r}. Known: {known}.")
|
|
1318
1409
|
|
|
1319
1410
|
write_config(config)
|