moa-cli 0.2.2__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: moa-cli
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.0
|
|
4
4
|
Summary: Ask one question to multiple local AI coding CLIs in parallel and collect their answers.
|
|
5
5
|
Keywords: llm,agents,cli,claude,codex,agy,opencode,peer-review
|
|
6
6
|
Author: Paul-Louis Pröve
|
|
@@ -19,7 +19,7 @@ Description-Content-Type: text/markdown
|
|
|
19
19
|
|
|
20
20
|
# MOA - Mixture of Agents
|
|
21
21
|
|
|
22
|
-
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds
|
|
22
|
+
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds while a moderator checks for convergence and writes the verdict.
|
|
23
23
|
|
|
24
24
|
It's a drop-in, batteries-included replacement for hand-rolling parallel `claude -p` / `codex exec` / `opencode run` calls (or a "peer review" agent skill): one command, clean attributed output, made to be called by a human **or** by another agent.
|
|
25
25
|
|
|
@@ -73,7 +73,7 @@ MOA has three prompt verbs that share the same selection/output options:
|
|
|
73
73
|
|
|
74
74
|
- **`moa ask PROMPT`** - council / peer review: N agents answer the same prompt in parallel; every answer is returned with attribution, streamed as it lands.
|
|
75
75
|
- **`moa distill PROMPT`** - synthesis: run the council, then one strong aggregator merges the answers into a single unified response.
|
|
76
|
-
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds,
|
|
76
|
+
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, with a moderator that checks for convergence between rounds and writes the final verdict. The costliest mode; read the caveats below before reaching for it.
|
|
77
77
|
|
|
78
78
|
```bash
|
|
79
79
|
moa doctor # show installed CLIs and their default models
|
|
@@ -87,12 +87,12 @@ moa ask --json "..." # machine-readable JSONL (for agents
|
|
|
87
87
|
git diff | moa ask -f - "Review this diff." # read the prompt from stdin
|
|
88
88
|
moa distill "Design a rate limiter." # council, then merge into one answer
|
|
89
89
|
moa distill -s codex "..." # pick who distills (auto | random | provider)
|
|
90
|
-
moa debate "Is this race condition real?" # 2 debaters
|
|
90
|
+
moa debate "Is this race condition real?" # 2 debaters; the first also moderates (default 2 agents)
|
|
91
91
|
moa debate -r 3 "..." # more rounds (default 2, hard max 4)
|
|
92
|
-
moa debate
|
|
92
|
+
moa debate --moderator agy "..." # pin a neutral moderator (a non-debater)
|
|
93
93
|
```
|
|
94
94
|
|
|
95
|
-
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and
|
|
95
|
+
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `--moderator`.
|
|
96
96
|
|
|
97
97
|
### Read-only by default
|
|
98
98
|
|
|
@@ -203,7 +203,7 @@ moa config unset num # remove a key
|
|
|
203
203
|
moa config unset model claude # remove one [models] entry
|
|
204
204
|
```
|
|
205
205
|
|
|
206
|
-
The
|
|
206
|
+
The role defaults are persistable too: the distill `synthesizer` and the debate `moderator` (e.g. `moa config set synthesizer codex`, `moa config set moderator agy`). `debate`'s `-r/--rounds` is not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
|
|
207
207
|
|
|
208
208
|
### Output
|
|
209
209
|
|
|
@@ -221,23 +221,23 @@ The synthesizer default is persistable too (e.g. `moa config set synthesizer cod
|
|
|
221
221
|
|
|
222
222
|
The aggregator prompt is adapted from the Mixture-of-Agents "Aggregate-and-Synthesize" prompt (Wang et al. 2024): it tells the aggregator to critically evaluate the inputs (some may be biased or incorrect) and not to simply replicate them but offer a refined, accurate, comprehensive reply.
|
|
223
223
|
|
|
224
|
-
### `moa debate` (sequential debate +
|
|
224
|
+
### `moa debate` (sequential debate + moderator)
|
|
225
225
|
|
|
226
|
-
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange
|
|
226
|
+
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange overseen by a **moderator** that checks for convergence between rounds and writes the final answer.
|
|
227
227
|
|
|
228
|
-
**Roles.**
|
|
228
|
+
**Roles.** The top **2** selected agents are the debaters. The **moderator** runs the per-round convergence check and writes the verdict; by default it is the top-priority selected agent (so the default 2-agent debate has agent #1 also moderate). Debate only needs **2 agents**; with fewer it exits cleanly rather than silently degrading. For a **neutral** moderator that doesn't also debate, select a third agent and pin it: `moa debate -n 3 --moderator <provider>` (the moderator must be one of the selected agents). The moderator only ever sees the transcript **anonymized + shuffled**, so even when it is itself a debater it can't favour its own answer.
|
|
229
229
|
|
|
230
230
|
**Rounds.** `-r/--rounds` defaults to **2** (gains plateau around 2-3 rounds while token cost grows multiplicatively) and is hard-capped at **4** - higher values are clamped with a warning on stderr.
|
|
231
231
|
|
|
232
|
-
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit.
|
|
232
|
+
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. After each non-final round the **moderator** reads the debaters' latest answers and replies `DONE` (they've converged or fully aired their disagreement) or `CONTINUE`; a `DONE` stops the debate before the cap.
|
|
233
233
|
|
|
234
|
-
**The
|
|
234
|
+
**The verdict.** The moderator reads the full transcript - presented **anonymized and order-shuffled** (so brand/position bias is killed, even when the moderator was a debater) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The verdict is the final block (`──── verdict · moderator <name> · ... ────`).
|
|
235
235
|
|
|
236
|
-
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the
|
|
236
|
+
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the moderator's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", "moderator": "<name>", ...}` record.
|
|
237
237
|
|
|
238
|
-
**Safety.** Debaters and the
|
|
238
|
+
**Safety.** Debaters and the moderator run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
|
|
239
239
|
|
|
240
|
-
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters
|
|
240
|
+
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters × rounds` calls, plus a moderator check per round and the verdict) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The moderator and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
|
|
241
241
|
|
|
242
242
|
### Attribution policy
|
|
243
243
|
|
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
|
|
9
9
|
# MOA - Mixture of Agents
|
|
10
10
|
|
|
11
|
-
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds
|
|
11
|
+
Ask one question to multiple local AI coding CLIs **in parallel** and collect their answers. MOA detects which agent CLIs you have installed (Claude Code, Codex, agy, opencode), fans your prompt out to them, and streams each answer back the moment that agent finishes. Or run `moa distill` to have a strong aggregator merge those answers into a single unified response, or `moa debate` to have them critique each other across rounds while a moderator checks for convergence and writes the verdict.
|
|
12
12
|
|
|
13
13
|
It's a drop-in, batteries-included replacement for hand-rolling parallel `claude -p` / `codex exec` / `opencode run` calls (or a "peer review" agent skill): one command, clean attributed output, made to be called by a human **or** by another agent.
|
|
14
14
|
|
|
@@ -62,7 +62,7 @@ MOA has three prompt verbs that share the same selection/output options:
|
|
|
62
62
|
|
|
63
63
|
- **`moa ask PROMPT`** - council / peer review: N agents answer the same prompt in parallel; every answer is returned with attribution, streamed as it lands.
|
|
64
64
|
- **`moa distill PROMPT`** - synthesis: run the council, then one strong aggregator merges the answers into a single unified response.
|
|
65
|
-
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds,
|
|
65
|
+
- **`moa debate PROMPT`** - sequential debate: two debaters answer and adversarially critique each other across rounds, with a moderator that checks for convergence between rounds and writes the final verdict. The costliest mode; read the caveats below before reaching for it.
|
|
66
66
|
|
|
67
67
|
```bash
|
|
68
68
|
moa doctor # show installed CLIs and their default models
|
|
@@ -76,12 +76,12 @@ moa ask --json "..." # machine-readable JSONL (for agents
|
|
|
76
76
|
git diff | moa ask -f - "Review this diff." # read the prompt from stdin
|
|
77
77
|
moa distill "Design a rate limiter." # council, then merge into one answer
|
|
78
78
|
moa distill -s codex "..." # pick who distills (auto | random | provider)
|
|
79
|
-
moa debate "Is this race condition real?" # 2 debaters
|
|
79
|
+
moa debate "Is this race condition real?" # 2 debaters; the first also moderates (default 2 agents)
|
|
80
80
|
moa debate -r 3 "..." # more rounds (default 2, hard max 4)
|
|
81
|
-
moa debate
|
|
81
|
+
moa debate --moderator agy "..." # pin a neutral moderator (a non-debater)
|
|
82
82
|
```
|
|
83
83
|
|
|
84
|
-
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and
|
|
84
|
+
The shared options (`-n/--num`, `-p/--provider`, `-x/--exclude`, `-m/--model`, `-t/--timeout`, `-f/--file`, `--json`, `--yolo`) work identically on all three verbs. `distill` adds `-s/--synthesizer`; `debate` adds `-r/--rounds` and `--moderator`.
|
|
85
85
|
|
|
86
86
|
### Read-only by default
|
|
87
87
|
|
|
@@ -192,7 +192,7 @@ moa config unset num # remove a key
|
|
|
192
192
|
moa config unset model claude # remove one [models] entry
|
|
193
193
|
```
|
|
194
194
|
|
|
195
|
-
The
|
|
195
|
+
The role defaults are persistable too: the distill `synthesizer` and the debate `moderator` (e.g. `moa config set synthesizer codex`, `moa config set moderator agy`). `debate`'s `-r/--rounds` is not persisted. CLI `-m` overrides win per-provider over the config `[models]` table.
|
|
196
196
|
|
|
197
197
|
### Output
|
|
198
198
|
|
|
@@ -210,23 +210,23 @@ The synthesizer default is persistable too (e.g. `moa config set synthesizer cod
|
|
|
210
210
|
|
|
211
211
|
The aggregator prompt is adapted from the Mixture-of-Agents "Aggregate-and-Synthesize" prompt (Wang et al. 2024): it tells the aggregator to critically evaluate the inputs (some may be biased or incorrect) and not to simply replicate them but offer a refined, accurate, comprehensive reply.
|
|
212
212
|
|
|
213
|
-
### `moa debate` (sequential debate +
|
|
213
|
+
### `moa debate` (sequential debate + moderator)
|
|
214
214
|
|
|
215
|
-
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange
|
|
215
|
+
`debate` is the opt-in, highest-cost mode. Instead of fanning out in parallel, it runs a sequential, adversarial exchange overseen by a **moderator** that checks for convergence between rounds and writes the final answer.
|
|
216
216
|
|
|
217
|
-
**Roles.**
|
|
217
|
+
**Roles.** The top **2** selected agents are the debaters. The **moderator** runs the per-round convergence check and writes the verdict; by default it is the top-priority selected agent (so the default 2-agent debate has agent #1 also moderate). Debate only needs **2 agents**; with fewer it exits cleanly rather than silently degrading. For a **neutral** moderator that doesn't also debate, select a third agent and pin it: `moa debate -n 3 --moderator <provider>` (the moderator must be one of the selected agents). The moderator only ever sees the transcript **anonymized + shuffled**, so even when it is itself a debater it can't favour its own answer.
|
|
218
218
|
|
|
219
219
|
**Rounds.** `-r/--rounds` defaults to **2** (gains plateau around 2-3 rounds while token cost grows multiplicatively) and is hard-capped at **4** - higher values are clamped with a warning on stderr.
|
|
220
220
|
|
|
221
|
-
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit.
|
|
221
|
+
**The loop.** Round 1: debater A answers cold; debater B sees A's answer with an adversarial-stance instruction ("identify errors/weaknesses before giving your own answer; do not agree merely to reach consensus"). Each later round, every debater sees the other's latest answer and responds in the same spirit. After each non-final round the **moderator** reads the debaters' latest answers and replies `DONE` (they've converged or fully aired their disagreement) or `CONTINUE`; a `DONE` stops the debate before the cap.
|
|
222
222
|
|
|
223
|
-
**The
|
|
223
|
+
**The verdict.** The moderator reads the full transcript - presented **anonymized and order-shuffled** (so brand/position bias is killed, even when the moderator was a debater) - and writes the final answer. Its prompt instructs it to weigh correctness and evidence **above** confidence and fluency. The verdict is the final block (`──── verdict · moderator <name> · ... ────`).
|
|
224
224
|
|
|
225
|
-
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the
|
|
225
|
+
**Streaming/output.** Each debater's turn streams as it completes (`──── round N · <provider> · ... ────`), then the moderator's verdict last. `--json` emits a `{"type": "debate_turn", "round": N, ...}` record per turn plus a final `{"type": "verdict", "moderator": "<name>", ...}` record.
|
|
226
226
|
|
|
227
|
-
**Safety.** Debaters and the
|
|
227
|
+
**Safety.** Debaters and the moderator run in the same read-only (or `--yolo`) mode as the other verbs - there is no permission bypass. agy's partial-sandbox caveat (shell only; it can still edit files) applies here too.
|
|
228
228
|
|
|
229
|
-
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters
|
|
229
|
+
> **Caveat - use sparingly.** Debate is the costliest mode (roughly `debaters × rounds` calls, plus a moderator check per round and the verdict) **and the least reliably beneficial.** The research is mixed-to-negative: multi-agent debate can converge on a *wrong* answer through conformity, a confident-but-incorrect debater can win on persuasiveness over correctness, and more rounds can entrench an error rather than fix it. The moderator and the adversarial-stance prompt are there to fight these failure modes, but they do not eliminate them. For most questions, `ask` or `distill` is the better default; reach for `debate` when you specifically want to surface and stress-test disagreement. (See *Can LLM Agents Really Debate?* arXiv:2511.07784, *Talk Isn't Always Cheap* arXiv:2509.05396, and the conformity/position-bias work cited in the design notes.)
|
|
230
230
|
|
|
231
231
|
### Attribution policy
|
|
232
232
|
|
|
@@ -377,11 +377,12 @@ def build_synthesis_prompt(
|
|
|
377
377
|
|
|
378
378
|
|
|
379
379
|
# --------------------------------------------------------------------------- #
|
|
380
|
-
# Debate: sequential adversarial rounds,
|
|
381
|
-
#
|
|
382
|
-
#
|
|
383
|
-
#
|
|
384
|
-
#
|
|
380
|
+
# Debate: sequential adversarial rounds, with a moderator that checks for
|
|
381
|
+
# convergence after each round and then writes the verdict from the (anonymized
|
|
382
|
+
# + shuffled) full transcript. The literature is clear that debate is the
|
|
383
|
+
# costliest and least reliably-beneficial mode: it can converge on a wrong answer
|
|
384
|
+
# (conformity), so the verdict prompt weighs correctness/evidence over confidence
|
|
385
|
+
# and fluency, and the anonymization holds even when the moderator also debated.
|
|
385
386
|
# --------------------------------------------------------------------------- #
|
|
386
387
|
|
|
387
388
|
ROUNDS_MAX = 4
|
|
@@ -393,23 +394,16 @@ ADVERSARIAL_INSTRUCTION = """Before giving your own answer, critically examine t
|
|
|
393
394
|
other participant's answer above: identify any errors, weaknesses, unsupported claims, or \
|
|
394
395
|
gaps in reasoning. Do NOT agree merely to reach consensus - only concede a point if it is \
|
|
395
396
|
genuinely correct. Then give your own best, complete answer to the original question, \
|
|
396
|
-
incorporating any valid corrections.
|
|
397
|
+
incorporating any valid corrections."""
|
|
397
398
|
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
|
|
401
|
-
|
|
402
|
-
|
|
403
|
-
|
|
404
|
-
|
|
405
|
-
|
|
406
|
-
# The neutral judge reads the full transcript (anonymized + shuffled) and writes
|
|
407
|
-
# the final answer. It must weigh correctness/evidence over confidence/fluency -
|
|
408
|
-
# this is where conformity-to-a-wrong-answer is most dangerous, so the judge
|
|
409
|
-
# never just echoes the most fluent or most confident debater.
|
|
410
|
-
JUDGE_PROMPT = """You are a neutral judge. Below is a transcript of a debate between AI coding \
|
|
411
|
-
assistants who answered the user's question and then critiqued each other's answers across \
|
|
412
|
-
several rounds. The participants are anonymized and presented in arbitrary order.
|
|
399
|
+
# The moderator reads the full transcript (anonymized + shuffled) and writes the
|
|
400
|
+
# final answer. It must weigh correctness/evidence over confidence/fluency - this
|
|
401
|
+
# is where conformity-to-a-wrong-answer is most dangerous, so it never just echoes
|
|
402
|
+
# the most fluent or most confident debater.
|
|
403
|
+
MODERATOR_VERDICT_PROMPT = """You are the moderator of this debate. Below is a transcript of a \
|
|
404
|
+
debate between AI coding assistants who answered the user's question and then critiqued each \
|
|
405
|
+
other's answers across several rounds. The participants are anonymized and presented in \
|
|
406
|
+
arbitrary order.
|
|
413
407
|
|
|
414
408
|
Your task is to read the full debate and write the single best, final answer to the user's \
|
|
415
409
|
question. Weigh correctness and the strength of evidence and reasoning ABOVE confidence, \
|
|
@@ -424,43 +418,50 @@ best possible answer.
|
|
|
424
418
|
asserted.
|
|
425
419
|
- Do not invent information that the debate does not support."""
|
|
426
420
|
|
|
421
|
+
# After each non-final round the moderator decides whether another round would
|
|
422
|
+
# materially help. It replies with a single leading word the caller branches on.
|
|
423
|
+
CONVERGENCE_DONE = "DONE"
|
|
424
|
+
MODERATOR_CONVERGENCE_PROMPT = """You are the moderator of this debate. Below are the debaters' \
|
|
425
|
+
latest answers to the user's question, anonymized. Decide whether they have converged on an \
|
|
426
|
+
answer, or at least fully aired and clarified their disagreement, so that another round would \
|
|
427
|
+
add nothing material.
|
|
428
|
+
|
|
429
|
+
Reply with EXACTLY one word on the first line: DONE if the debate should stop now, or CONTINUE \
|
|
430
|
+
if another round would materially improve the final answer. Add nothing else."""
|
|
431
|
+
|
|
427
432
|
|
|
428
433
|
def assign_debate_roles(
|
|
429
|
-
selected: list[Provider],
|
|
434
|
+
selected: list[Provider], moderator: str | None
|
|
430
435
|
) -> tuple[list[Provider], Provider]:
|
|
431
|
-
"""Split the selected providers into (debaters,
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
|
|
436
|
+
"""Split the selected providers into (debaters, moderator).
|
|
437
|
+
|
|
438
|
+
The top 2 selected providers debate. The moderator runs the per-round
|
|
439
|
+
convergence check and writes the final verdict; it MAY be one of the debaters.
|
|
440
|
+
`moderator` is "auto" (or None) -> the top-priority selected provider (so the
|
|
441
|
+
default 2-agent debate has agent #1 also moderate), or a provider name that
|
|
442
|
+
must be among the selected providers (pin a non-debating 3rd for a neutral
|
|
443
|
+
moderator). Requires at least 2 selected providers; raises ValueError
|
|
444
|
+
otherwise (the caller turns this into a clean exit - debate never silently
|
|
445
|
+
degrades).
|
|
439
446
|
"""
|
|
440
|
-
if
|
|
441
|
-
|
|
442
|
-
|
|
443
|
-
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
|
|
449
|
-
|
|
450
|
-
|
|
451
|
-
|
|
452
|
-
|
|
453
|
-
f"debate needs at least 2 debaters plus the judge ({judge}); only "
|
|
454
|
-
f"{len(debaters)} non-judge provider(s) available. Increase -n or -p."
|
|
455
|
-
)
|
|
456
|
-
return debaters, judge_provider
|
|
457
|
-
|
|
458
|
-
if len(selected) < 3:
|
|
447
|
+
if len(selected) < 2:
|
|
448
|
+
raise ValueError(
|
|
449
|
+
f"debate needs at least 2 providers (2 debaters); only {len(selected)} available. "
|
|
450
|
+
f"Increase -n, pin more with -p, or install more agents."
|
|
451
|
+
)
|
|
452
|
+
debaters = selected[:2]
|
|
453
|
+
if moderator in (None, "auto"):
|
|
454
|
+
return debaters, selected[0]
|
|
455
|
+
|
|
456
|
+
names = [p.name for p in selected]
|
|
457
|
+
if moderator not in PROVIDERS:
|
|
458
|
+
raise ValueError(f"Unknown moderator: {moderator}")
|
|
459
|
+
if moderator not in names:
|
|
459
460
|
raise ValueError(
|
|
460
|
-
f"
|
|
461
|
-
f"
|
|
461
|
+
f"Moderator {moderator!r} is not among the selected providers ({', '.join(names)}). "
|
|
462
|
+
f"Pin it with -p {moderator} or widen the selection."
|
|
462
463
|
)
|
|
463
|
-
return
|
|
464
|
+
return debaters, next(p for p in selected if p.name == moderator)
|
|
464
465
|
|
|
465
466
|
|
|
466
467
|
def clamp_rounds(rounds: int) -> tuple[int, str | None]:
|
|
@@ -496,18 +497,19 @@ def build_debate_turn_prompt(
|
|
|
496
497
|
)
|
|
497
498
|
|
|
498
499
|
|
|
499
|
-
def
|
|
500
|
+
def build_verdict_prompt(
|
|
500
501
|
question: str,
|
|
501
502
|
transcript: list[RunResult],
|
|
502
503
|
rng: random.Random | None = None,
|
|
503
504
|
) -> tuple[str, dict[str, str]]:
|
|
504
|
-
"""Build the
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
|
|
508
|
-
|
|
509
|
-
|
|
510
|
-
|
|
505
|
+
"""Build the moderator's final-verdict prompt from the transcript, anonymized
|
|
506
|
+
+ shuffled.
|
|
507
|
+
|
|
508
|
+
The transcript is the per-turn RunResults; the moderator sees only the final
|
|
509
|
+
answer text of each turn, relabelled "Participant 1/2/.." in shuffled order so
|
|
510
|
+
brand/position bias is killed - this matters even when the moderator is itself
|
|
511
|
+
a debater, since it can't tell which answer is its own. The label_map maps each
|
|
512
|
+
label back to the real provider for the caller, though debate never reveals it.
|
|
511
513
|
"""
|
|
512
514
|
turns = [r for r in transcript if r.status == "ok"]
|
|
513
515
|
shuffled = list(turns)
|
|
@@ -519,13 +521,27 @@ def build_judge_prompt(
|
|
|
519
521
|
sections.append(f"### {label}\n\n{result.stdout.strip()}")
|
|
520
522
|
label_map[label] = result.provider
|
|
521
523
|
prompt = (
|
|
522
|
-
f"{
|
|
524
|
+
f"{MODERATOR_VERDICT_PROMPT}\n\n"
|
|
523
525
|
f"## User question\n\n{question}\n\n"
|
|
524
526
|
f"## Debate transcript\n\n" + "\n\n".join(sections) + "\n\n## Your final answer\n"
|
|
525
527
|
)
|
|
526
528
|
return prompt, label_map
|
|
527
529
|
|
|
528
530
|
|
|
531
|
+
def build_convergence_prompt(question: str, latest: list[RunResult]) -> str:
|
|
532
|
+
"""The moderator's per-round convergence check. `latest` is the debaters' most
|
|
533
|
+
recent answers, anonymized so the moderator judges substance over brand. The
|
|
534
|
+
expected reply starts with DONE (stop) or CONTINUE (another round helps)."""
|
|
535
|
+
answers = "\n\n".join(
|
|
536
|
+
f"### Participant {i + 1}\n\n{r.stdout.strip()}" for i, r in enumerate(latest)
|
|
537
|
+
)
|
|
538
|
+
return (
|
|
539
|
+
f"{MODERATOR_CONVERGENCE_PROMPT}\n\n"
|
|
540
|
+
f"## User question\n\n{question}\n\n"
|
|
541
|
+
f"## The debaters' latest answers\n\n{answers}\n\n## Your decision\n"
|
|
542
|
+
)
|
|
543
|
+
|
|
544
|
+
|
|
529
545
|
# --------------------------------------------------------------------------- #
|
|
530
546
|
# Render: stdout carries content (Markdown or JSONL); stderr carries progress.
|
|
531
547
|
# --------------------------------------------------------------------------- #
|
|
@@ -627,10 +643,10 @@ def render_debate_turn_block(result: RunResult, round_num: int, plain: bool | No
|
|
|
627
643
|
return _render(label, result, plain)
|
|
628
644
|
|
|
629
645
|
|
|
630
|
-
def
|
|
646
|
+
def render_verdict_block(result: RunResult, moderator: str, plain: bool | None = None) -> str:
|
|
631
647
|
if plain is None:
|
|
632
648
|
plain = _plain_output()
|
|
633
|
-
label = f"verdict ·
|
|
649
|
+
label = f"verdict · moderator {moderator} · {_status_label(result.status)} · {result.elapsed:.1f}s"
|
|
634
650
|
return _render(label, result, plain)
|
|
635
651
|
|
|
636
652
|
|
|
@@ -648,10 +664,10 @@ def debate_turn_record(result: RunResult, round_num: int) -> dict:
|
|
|
648
664
|
}
|
|
649
665
|
|
|
650
666
|
|
|
651
|
-
def
|
|
667
|
+
def verdict_record(result: RunResult, moderator: str) -> dict:
|
|
652
668
|
return {
|
|
653
669
|
"type": "verdict",
|
|
654
|
-
"
|
|
670
|
+
"moderator": moderator,
|
|
655
671
|
"status": result.status,
|
|
656
672
|
"elapsed": round(result.elapsed, 3),
|
|
657
673
|
"text": result.stdout,
|
|
@@ -672,12 +688,21 @@ def judge_record(result: RunResult, judge: str) -> dict:
|
|
|
672
688
|
|
|
673
689
|
# Scalar config keys and the type each maps to. `exclude` (list[str]) and the
|
|
674
690
|
# `[models]` table are handled separately because they aren't plain scalars.
|
|
675
|
-
_CONFIG_SCALARS: dict[str, type] = {"num": int, "timeout": float, "synthesizer": str}
|
|
691
|
+
_CONFIG_SCALARS: dict[str, type] = {"num": int, "timeout": float, "synthesizer": str, "moderator": str}
|
|
676
692
|
_CONFIG_KEYS: tuple[str, ...] = (*_CONFIG_SCALARS, "exclude", "models")
|
|
677
693
|
# Synthesizer accepts the special modes plus any known provider name.
|
|
678
694
|
_SYNTHESIZER_MODES: tuple[str, ...] = ("auto", "first", "random")
|
|
695
|
+
# Moderator accepts "auto" (the top-priority selected agent) or a provider name.
|
|
696
|
+
_MODERATOR_MODES: tuple[str, ...] = ("auto",)
|
|
679
697
|
# The built-in defaults, shown by `config show` when a key isn't in the file.
|
|
680
|
-
_CONFIG_DEFAULTS: dict = {
|
|
698
|
+
_CONFIG_DEFAULTS: dict = {
|
|
699
|
+
"num": 3,
|
|
700
|
+
"timeout": 180.0,
|
|
701
|
+
"synthesizer": "auto",
|
|
702
|
+
"moderator": "auto",
|
|
703
|
+
"exclude": [],
|
|
704
|
+
"models": {},
|
|
705
|
+
}
|
|
681
706
|
|
|
682
707
|
|
|
683
708
|
def config_dir() -> Path:
|
|
@@ -709,6 +734,9 @@ def _validate_scalar(key: str, value) -> None:
|
|
|
709
734
|
if key == "synthesizer" and value not in (*_SYNTHESIZER_MODES, *PROVIDERS):
|
|
710
735
|
allowed = ", ".join((*_SYNTHESIZER_MODES, *PROVIDERS))
|
|
711
736
|
raise ValueError(f"synthesizer must be one of: {allowed}.")
|
|
737
|
+
if key == "moderator" and value not in (*_MODERATOR_MODES, *PROVIDERS):
|
|
738
|
+
allowed = ", ".join((*_MODERATOR_MODES, *PROVIDERS))
|
|
739
|
+
raise ValueError(f"moderator must be one of: {allowed}.")
|
|
712
740
|
|
|
713
741
|
|
|
714
742
|
def load_config() -> dict:
|
|
@@ -786,6 +814,8 @@ def serialize_config(config: dict) -> str:
|
|
|
786
814
|
lines.append(f"timeout = {int(timeout) if timeout.is_integer() else timeout!r}")
|
|
787
815
|
if "synthesizer" in config:
|
|
788
816
|
lines.append(f"synthesizer = {_toml_str(config['synthesizer'])}")
|
|
817
|
+
if "moderator" in config:
|
|
818
|
+
lines.append(f"moderator = {_toml_str(config['moderator'])}")
|
|
789
819
|
if "exclude" in config:
|
|
790
820
|
items = ", ".join(_toml_str(v) for v in config["exclude"])
|
|
791
821
|
lines.append(f"exclude = [{items}]")
|
|
@@ -973,6 +1003,7 @@ def resolve_run(
|
|
|
973
1003
|
timeout: float | None,
|
|
974
1004
|
json_output: bool,
|
|
975
1005
|
yolo: bool,
|
|
1006
|
+
default_num: int = 3,
|
|
976
1007
|
) -> RunConfig:
|
|
977
1008
|
"""Resolve the shared options into a RunConfig, emitting the selection note.
|
|
978
1009
|
|
|
@@ -981,7 +1012,9 @@ def resolve_run(
|
|
|
981
1012
|
flag), parse model overrides, select providers, and print the stderr
|
|
982
1013
|
selection note (including agy's honest partial-protection note). Every verb
|
|
983
1014
|
picks up config defaults identically because the merge lives only here.
|
|
984
|
-
|
|
1015
|
+
`default_num` is the built-in fallback when neither flag nor config sets num
|
|
1016
|
+
(debate passes 2, since it only needs 2 agents). Raises typer.BadParameter on
|
|
1017
|
+
bad input and typer.Exit(1) when nothing runs.
|
|
985
1018
|
"""
|
|
986
1019
|
prompt_text = _read_prompt(prompt, file)
|
|
987
1020
|
if not prompt_text:
|
|
@@ -993,7 +1026,7 @@ def resolve_run(
|
|
|
993
1026
|
except ValueError as exc:
|
|
994
1027
|
raise typer.BadParameter(f"{config_path()}: {exc}") from exc
|
|
995
1028
|
|
|
996
|
-
num = resolve_option(num, "num", config,
|
|
1029
|
+
num = resolve_option(num, "num", config, default_num)
|
|
997
1030
|
timeout = resolve_option(timeout, "timeout", config, 180.0)
|
|
998
1031
|
# Repeatable flags are an empty list when omitted, not None, so treat empty
|
|
999
1032
|
# as "fall back to config" for exclude.
|
|
@@ -1131,9 +1164,12 @@ def _run_synthesis(
|
|
|
1131
1164
|
RoundsOpt = Annotated[
|
|
1132
1165
|
int, typer.Option("--rounds", "-r", help=f"Debate rounds (default 2, hard max {ROUNDS_MAX}).")
|
|
1133
1166
|
]
|
|
1134
|
-
|
|
1167
|
+
ModeratorOpt = Annotated[
|
|
1135
1168
|
str | None,
|
|
1136
|
-
typer.Option(
|
|
1169
|
+
typer.Option(
|
|
1170
|
+
"--moderator", "-j",
|
|
1171
|
+
help="Moderator that checks convergence and writes the verdict: auto | a provider.",
|
|
1172
|
+
),
|
|
1137
1173
|
]
|
|
1138
1174
|
|
|
1139
1175
|
|
|
@@ -1147,60 +1183,80 @@ def debate(
|
|
|
1147
1183
|
file: FileOpt = None,
|
|
1148
1184
|
timeout: TimeoutOpt = None,
|
|
1149
1185
|
rounds: RoundsOpt = 2,
|
|
1150
|
-
|
|
1186
|
+
moderator: ModeratorOpt = None,
|
|
1151
1187
|
json_output: JsonOpt = False,
|
|
1152
1188
|
yolo: YoloOpt = False,
|
|
1153
1189
|
) -> None:
|
|
1154
|
-
"""Debate: debaters answer and critique each other across rounds; a
|
|
1155
|
-
|
|
1190
|
+
"""Debate: two debaters answer and critique each other across rounds; a moderator checks convergence and writes the verdict."""
|
|
1191
|
+
# Debate only needs 2 agents (the moderator may also be a debater), so its
|
|
1192
|
+
# built-in default selection is 2, not the usual 3.
|
|
1193
|
+
cfg = resolve_run(
|
|
1194
|
+
prompt, file, num, provider, exclude, model, timeout, json_output, yolo, default_num=2
|
|
1195
|
+
)
|
|
1196
|
+
|
|
1197
|
+
# moderator is verb-specific (like distill's synthesizer) but persistable, so
|
|
1198
|
+
# it merges through the same precedence: CLI flag > config file > built-in.
|
|
1199
|
+
moderator = resolve_option(moderator, "moderator", _read_config_or_empty(), "auto")
|
|
1156
1200
|
|
|
1157
1201
|
rounds, warning = clamp_rounds(rounds)
|
|
1158
1202
|
if warning:
|
|
1159
1203
|
_note(warning)
|
|
1160
1204
|
|
|
1161
1205
|
try:
|
|
1162
|
-
debaters,
|
|
1206
|
+
debaters, moderator_provider = assign_debate_roles(cfg.selected, moderator)
|
|
1163
1207
|
except ValueError as exc:
|
|
1164
1208
|
_note(f"debate: {exc}")
|
|
1165
1209
|
raise typer.Exit(code=1) from exc
|
|
1166
1210
|
|
|
1167
1211
|
_note(
|
|
1168
|
-
f"Debating: {', '.join(p.name for p in debaters)} over {rounds} round(s), "
|
|
1169
|
-
f"
|
|
1170
|
-
f"
|
|
1212
|
+
f"Debating: {', '.join(p.name for p in debaters)} over up to {rounds} round(s), "
|
|
1213
|
+
f"moderator {moderator_provider.name}. Debate is the costliest mode and can "
|
|
1214
|
+
f"converge on a wrong answer."
|
|
1171
1215
|
)
|
|
1172
1216
|
|
|
1173
|
-
transcript = asyncio.run(_run_debate(cfg, debaters,
|
|
1217
|
+
transcript = asyncio.run(_run_debate(cfg, debaters, moderator_provider, rounds))
|
|
1174
1218
|
if not any(r.status == "ok" for r in transcript):
|
|
1175
1219
|
raise typer.Exit(code=1)
|
|
1176
1220
|
|
|
1177
1221
|
|
|
1178
|
-
def
|
|
1179
|
-
|
|
1180
|
-
|
|
1222
|
+
async def _moderator_signals_done(
|
|
1223
|
+
cfg: RunConfig, moderator: Provider, latest_ok: list[RunResult], round_num: int
|
|
1224
|
+
) -> bool:
|
|
1225
|
+
"""Ask the moderator whether the debate has converged. Returns True (stop)
|
|
1226
|
+
only on a clean DONE reply; a failed or CONTINUE check keeps debating."""
|
|
1227
|
+
prompt = build_convergence_prompt(cfg.prompt, latest_ok)
|
|
1228
|
+
_note(f"Round {round_num}: moderator {moderator.name} checking for convergence...")
|
|
1229
|
+
result = await run_provider(
|
|
1230
|
+
moderator, prompt, cfg.timeout, cfg.models.get(moderator.name), cfg.yolo
|
|
1231
|
+
)
|
|
1232
|
+
done = result.status == "ok" and result.stdout.strip().upper().startswith(CONVERGENCE_DONE)
|
|
1233
|
+
if done:
|
|
1234
|
+
_note(f"Moderator {moderator.name}: converged; stopping after round {round_num}.")
|
|
1235
|
+
return done
|
|
1181
1236
|
|
|
1182
1237
|
|
|
1183
1238
|
async def _run_debate(
|
|
1184
1239
|
cfg: RunConfig,
|
|
1185
1240
|
debaters: list[Provider],
|
|
1186
|
-
|
|
1241
|
+
moderator: Provider,
|
|
1187
1242
|
rounds: int,
|
|
1188
1243
|
) -> list[RunResult]:
|
|
1189
|
-
"""Run the sequential debate, then the
|
|
1190
|
-
|
|
1191
|
-
|
|
1192
|
-
|
|
1193
|
-
|
|
1194
|
-
|
|
1195
|
-
|
|
1196
|
-
|
|
1197
|
-
the verdict last
|
|
1244
|
+
"""Run the sequential debate, then the moderator's verdict. Returns the full
|
|
1245
|
+
transcript.
|
|
1246
|
+
|
|
1247
|
+
Each debater keeps its latest answer in `latest`. A turn shows the debater the
|
|
1248
|
+
OTHER debaters' latest answers (anonymized) plus the adversarial instruction;
|
|
1249
|
+
the very first turn (no priors yet) is a cold answer. Turns stream as they
|
|
1250
|
+
complete (stderr progress + stdout/JSON block). After each non-final round the
|
|
1251
|
+
moderator decides whether the debate has converged and can stop early. The
|
|
1252
|
+
moderator then reads the blind+shuffled transcript and writes the verdict last
|
|
1253
|
+
(it may itself be a debater - the anonymization stops it favouring its own
|
|
1254
|
+
answer).
|
|
1198
1255
|
"""
|
|
1199
1256
|
transcript: list[RunResult] = []
|
|
1200
1257
|
latest: dict[str, RunResult] = {}
|
|
1201
1258
|
|
|
1202
1259
|
for round_num in range(1, rounds + 1):
|
|
1203
|
-
converged_this_round = True
|
|
1204
1260
|
for debater in debaters:
|
|
1205
1261
|
prior = [
|
|
1206
1262
|
("the other participant", latest[other.name].stdout)
|
|
@@ -1219,34 +1275,36 @@ async def _run_debate(
|
|
|
1219
1275
|
if cfg.json_output
|
|
1220
1276
|
else render_debate_turn_block(result, round_num)
|
|
1221
1277
|
)
|
|
1222
|
-
# A debater that errors out is not "converged"; only an explicit
|
|
1223
|
-
# no-change signal counts toward an early stop.
|
|
1224
|
-
if not _signals_convergence(result):
|
|
1225
|
-
converged_this_round = False
|
|
1226
1278
|
|
|
1227
|
-
#
|
|
1228
|
-
#
|
|
1229
|
-
if round_num
|
|
1230
|
-
|
|
1231
|
-
|
|
1279
|
+
# After each non-final round, let the moderator stop early if the debaters
|
|
1280
|
+
# have converged. Needs both debaters' latest answers to compare.
|
|
1281
|
+
if round_num < rounds:
|
|
1282
|
+
latest_ok = [
|
|
1283
|
+
latest[d.name] for d in debaters
|
|
1284
|
+
if d.name in latest and latest[d.name].status == "ok"
|
|
1285
|
+
]
|
|
1286
|
+
if len(latest_ok) >= 2 and await _moderator_signals_done(
|
|
1287
|
+
cfg, moderator, latest_ok, round_num
|
|
1288
|
+
):
|
|
1289
|
+
break
|
|
1232
1290
|
|
|
1233
1291
|
if not any(r.status == "ok" for r in transcript):
|
|
1234
|
-
_note("Debate produced no usable answers; skipping
|
|
1292
|
+
_note("Debate produced no usable answers; skipping the moderator verdict.")
|
|
1235
1293
|
return transcript
|
|
1236
1294
|
|
|
1237
|
-
# The
|
|
1238
|
-
# judging;
|
|
1239
|
-
#
|
|
1240
|
-
|
|
1241
|
-
_note(f"
|
|
1295
|
+
# The moderator always sees the transcript anonymized + shuffled (a model is
|
|
1296
|
+
# judging; no toggle). It runs in the same read-only / --yolo mode as the
|
|
1297
|
+
# debaters - no permission bypass.
|
|
1298
|
+
verdict_prompt, _label_map = build_verdict_prompt(cfg.prompt, transcript)
|
|
1299
|
+
_note(f"Moderator {moderator.name} writing the final answer...")
|
|
1242
1300
|
verdict = await run_provider(
|
|
1243
|
-
|
|
1301
|
+
moderator, verdict_prompt, cfg.timeout, cfg.models.get(moderator.name), cfg.yolo
|
|
1244
1302
|
)
|
|
1245
1303
|
transcript.append(verdict)
|
|
1246
1304
|
_emit(
|
|
1247
|
-
json.dumps(
|
|
1305
|
+
json.dumps(verdict_record(verdict, moderator.name))
|
|
1248
1306
|
if cfg.json_output
|
|
1249
|
-
else
|
|
1307
|
+
else render_verdict_block(verdict, moderator.name)
|
|
1250
1308
|
)
|
|
1251
1309
|
return transcript
|
|
1252
1310
|
|
|
@@ -1313,7 +1371,7 @@ def config_show() -> None:
|
|
|
1313
1371
|
|
|
1314
1372
|
@config_app.command("set")
|
|
1315
1373
|
def config_set(
|
|
1316
|
-
key: Annotated[str, typer.Argument(help="Config key: num | timeout | synthesizer | exclude | model.")],
|
|
1374
|
+
key: Annotated[str, typer.Argument(help="Config key: num | timeout | synthesizer | moderator | exclude | model.")],
|
|
1317
1375
|
value: Annotated[str, typer.Argument(help="Value. For models: PROVIDER=MODEL. For exclude: comma-separated names.")],
|
|
1318
1376
|
) -> None:
|
|
1319
1377
|
"""Write a value to the config file, creating the dir/file if missing."""
|
|
@@ -1346,7 +1404,7 @@ def config_set(
|
|
|
1346
1404
|
raise typer.BadParameter(str(exc)) from exc
|
|
1347
1405
|
config[key] = coerced
|
|
1348
1406
|
else:
|
|
1349
|
-
known = "num, timeout, synthesizer, exclude, model"
|
|
1407
|
+
known = "num, timeout, synthesizer, moderator, exclude, model"
|
|
1350
1408
|
raise typer.BadParameter(f"Unknown config key: {key!r}. Known: {known}.")
|
|
1351
1409
|
|
|
1352
1410
|
write_config(config)
|