npm - @dogfood-lab/study-swarm - Versions diffs - 0.6.0 → 1.1.0 - Mend

@dogfood-lab/study-swarm 0.6.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/CHANGELOG.md +56 -0
package/PROTOCOL.md +39 -4
package/README.es.md +73 -32
package/README.fr.md +73 -32
package/README.hi.md +79 -38
package/README.it.md +80 -39
package/README.ja.md +79 -38
package/README.md +46 -5
package/README.pt-BR.md +73 -32
package/README.zh.md +80 -39
package/SECURITY.md +6 -6
package/bin/study-swarm.mjs +176 -48
package/examples/study-swarm-ci.yml +28 -0
package/examples/study-swarm-self.dispatch.md +46 -0
package/examples/study-swarm-v1_1.dispatch.md +89 -0
package/package.json +2 -1

package/examples/study-swarm-v1_1.dispatch.md ADDED Viewed

@@ -0,0 +1,89 @@
+<!-- study-swarm v1.1.0 · protocol-sha256:4479e7d2d758f42a · created:2026-06-29 -->
+# Study-swarm dispatch: study-swarm-v1_1 (the protocol run on itself)
+> **Meta-dispatch.** study-swarm v1.0.0 grounds its *central* decision (different-family verifier
+> + retrieval oracle + ensemble diversity) in 6 citations. This dispatch runs the protocol on the
+> **four design questions v1.0.0 leaves answered by "I think…", not "evidence says…"** — the v1.1 surface.
+> Every citation below was gated through Step 4 (retrieval oracle for existence + two different-family
+> groundedness lenses, reasoning-stripped) **before** it informed the architecture. The synthesizer is
+> Claude/Opus; the verifier families are Mistral + IBM Granite + the deterministic arXiv oracle — none of
+> them Claude. Run `study-swarm lint study-swarm-v1_1.dispatch.md` (it passes).
+## Step 1 — Load-bearing questions
+Each passes the v1.0.0 test (two real designs hinge on the answer; an adjacent field has measured it; the current spec is silent or hand-wavy):
+- **A — Groundedness mechanism.** Step 4 stage-2 says only "an NLI-style support check." Should it check the finding sentence *whole*, or **decompose** it into atomic/molecular claims and check each?
+- **B — Generation-time grounding.** Step 2 just "asks for URLs" and catches fabrication downstream by *dropping* findings. Should Step 2 instead **force retrieval at generation time**, or is post-hoc Step-4 verification enough?
+- **C — Aggregation rule.** v1.0.0 mandates "≥3 decorrelated lenses, diversity > count" but never says how to **combine** their verdicts. Disjunction? Majority? An oracle-gated cascade? (The published proof showed union catches traps *but* LLMs false-flag real recent papers.)
+- **D — Calibrated abstention.** The halt table has a hard `CANNOT_CONFIRM` category. Should verdicts instead carry **calibrated confidence** with a tuned **abstention** threshold?
+## Step 2 — Research dispatch
+Four parallel research agents (one per question), retrieval-required — a paper an agent could not fetch did not enter the dispatch. **Process note (an ANDON receipt):** lanes C and D first returned schema-valid *placeholder stubs* despite heavy retrieval (19–20 tool calls each); per the protocol's own "a research lane returning placeholders halts that lane" rule, both were **discarded and re-dispatched** with an anti-stub guard (C succeeded; D crashed the output schema and was re-run as a plain-text agent). This is finding **B** happening to the dispatch itself — generation succeeded but emission lost it; a coverage-recovery pass recovered it.
+## Step 3 — Research grounding
+<!-- Every finding: author + year + resolvable arXiv id, one-sentence finding, design implication. All gated by Step 4 below before Step 5. Findings are phrased to what the retrieved source supports; precise figures that live in a paper's body but not its abstract were softened to the abstract-grounded claim during Step 4 (noted there). -->
+1. **(A) Breaking text into atomic facts and scoring the fraction supported exposes partial-support failures a whole-sentence judgment masks — ChatGPT biographies score only ~58%, and an automated estimator tracks human scoring within ~2%.** Min et al. 2023 (arXiv:2305.14251). Implication: stage-2 should score *fraction-of-claims-supported*, because the danger case (a real paper whose finding sentence overstates it) is invisible whole-sentence but surfaces as one unsupported atomic claim.
+2. **(A) Citation support is a ternary state (fully / partially / not), and even the best systems lack complete support ~50% of the time on ELI5 — partial support is the dominant, hardest-to-catch state.** Gao et al. 2023 (arXiv:2305.14627). Implication: the stage-2 verdict space must be ternary, not binary; a "partially supported" finding routes to correct-once/escalate, never auto-pass.
+3. **(A) Automatic attribution evaluation has a hard ceiling — a fine-tuned GPT-3.5 reaches only ~80% macro-F1, and the majority of its errors come from insensitivity to fine-grained information.** Li et al. 2024 (arXiv:2402.15089). Implication: no single LLM judge is a reliable groundedness oracle — this independently re-justifies the ≥3-lens ensemble and an abstain-on-nuance rule.
+4. **(A) Decompose-then-verify scores are sensitive to the decomposition method itself, so the metric must not attribute decomposition error to the text.** Wanner et al. 2024 (arXiv:2403.11903). Implication: pin the decomposer per run (PIN_PER_STEP) and do not score by raw subclaim count.
+5. **(A) Decomposition scores can be inflated by padding with obvious/repetitive subclaims; filtering subclaims by informativeness/uniqueness makes precision substantially more robust.** Jiang et al. 2024 (arXiv:2407.03572). Implication: stage-2 needs an informativeness filter so only the load-bearing claim in a finding gates the verdict — boilerplate earns no support credit, blocking pad-to-pass.
+6. **(A) Decontextualizing atomic claims before verification raises accuracy and almost never flips a true claim to false, so it safely rescues claims naive decomposition would wrongly drop.** Wanner et al. 2024 (arXiv:2412.13175). Implication: decontextualize each claim (resolve referents) before NLI-checking; the near-zero true→false rate makes this safe by default.
+7. **(A) "Molecular" facts — decontextualized + minimal — verify more accurately than fully atomic claims, while over-decontextualization loses error-localizing information.** Gunjal & Durrett 2024 (arXiv:2406.20079). Implication: target *molecular* granularity, not maximal atomicity — the concrete spec for stage-2 (neither whole-sentence nor over-shredded).
+8. **(B) Fine-tuning a model to browse and collect references during generation makes its answers human-checkable and preferred to demonstrator and reference answers.** Nakano et al. 2021 (arXiv:2112.09332). Implication: Step 2 should run agents in a browse-then-cite loop, citing only fetched sources, so each citation is attributable at generation time.
+9. **(B) Training a model to attach supporting evidence per claim and abstain when unsure raises supported-answer rates — but adversarial evaluation shows evidence-backed claims can still be false.** Menick et al. 2022 (arXiv:2203.11147). Implication: keep BOTH a generation-time grounding step AND the Step-4 gate; add abstention to Step 2 (an agent that cannot ground a claim drops it).
+10. **(B) An inline retrieve-and-self-critique loop cuts off-source (ungrounded) generation by roughly an order of magnitude versus comparable instruction-tuned models.** Asai et al. 2023 (arXiv:2310.11511). Implication: a lightweight in-Step-2 "is this in a fetched source?" check eliminates the bulk of fabrication before the gate runs, leaving Step 4 the residual.
+11. **(B) Parametric models fail badly on fast-changing knowledge while search augmentation substantially improves correctness, and both the number and ordering of retrieved evidences matter.** Vu et al. 2023 (arXiv:2310.03214). Implication: for a fast-moving field, recall of recent papers is the worst case — Step 2 must force live retrieval of multiple sources, not the first hit.
+12. **(B) Comparing generation-time vs post-hoc citation, retrieval is the main driver of quality in both, and there is a consistent trade-off: generation-time maximizes precision at the cost of coverage; post-hoc achieves higher coverage at competitive correctness.** Saxena et al. 2025 (arXiv:2509.21557). Implication: do not pick one axis — generation-time grounding floor + post-hoc groundedness ceiling + an explicit coverage-recovery sweep so true-but-hard-to-retrieve findings aren't silently dropped.
+13. **(B) Auditing real LLM/agent citations, a meaningful fraction of URLs are fully hallucinated and more are non-resolving, and citation-heavy "deep research" agents hallucinate at higher rates.** Rao et al. 2026 (arXiv:2604.03173). Implication: keep the deterministic existence oracle even under generation-time grounding (don't trust an agent's claim it fetched a source), and treat citation-heavy agents as higher-risk.
+14. **(B) A retrieval-grounded citation verifier reaches ~89 macro-F1 detecting hallucinated/corrupted citations and outperforms strong web-search LLM baselines, while a reasoning-only judge tops out far lower.** Khajavi et al. 2026 (arXiv:2605.27700). Implication: both the generation step AND the verifier lens must have live source access — a reasoning-only (memory) judge is the weakest configuration.
+15. **(C) A 9-judge panel across 7 model families provides only ~2 effective independent votes, and no aggregation algorithm fixes this because the bottleneck is correlated inputs, not the algorithm.** Kohli 2026 (arXiv:2605.29800). Implication: more LLM lenses cannot fix correlated false-flagging of recent papers — the deterministic oracle is load-bearing precisely because it's the one genuinely decorrelated, non-LLM lens; never stack same-family lenses expecting reliability.
+16. **(C) LLM validators have an agreeableness bias (high true-positive but very low true-negative rate); a tuned minority-veto beats both majority voting and raw disjunction at catching invalid items while bounding over-rejection.** Jain et al. 2025 (arXiv:2510.11822). Implication: the groundedness vote should be a *minority-veto with a tuned threshold n* — the explicit knob trading trap-catch against false-flagging — not disjunction (over-rejects) or majority (misses single-lens catches).
+17. **(C) A small human-labeled calibration set with bias correction beats adding more judges, halving maximum error.** Jain et al. 2025 (arXiv:2510.11822). Implication: maintain a small held-out set of labeled (real/fabricated/misattributed) citations and fit a lightweight bias-correction on the lenses' raw verdicts — cheaper and better than decorrelating yet more families.
+18. **(C) Agreement-based cascading uses inter-model disagreement as the routing/escalation signal and beats single-model-confidence cascades.** Kolawole et al. 2024 (arXiv:2407.02348). Implication: treat lens *disagreement* (oracle confirms existence but groundedness lenses split, especially on a post-cutoff paper) as the trigger to escalate-rather-than-auto-reject — directly bounding over-rejection of genuine recent work.
+19. **(C) LLM judges are systematically overconfident — verbalized confidence overstates accuracy — and a risk-aware confidence fusion makes them more reliable.** Tian et al. 2025 (arXiv:2508.06225). Implication: never trust a lens's raw verbalized confidence for aggregation; down-weight a confident "fabricated" flag on a recent paper relative to the oracle's existence verdict.
+20. **(C) Linear probes on a judge's hidden states give better-calibrated uncertainty than verbalized confidence, with conservative estimates suited to low-false-positive settings.** Radharapu et al. 2025 (arXiv:2512.22245). Implication: where lens internals are reachable, let a lens *abstain* below a calibrated-confidence threshold instead of casting a likely-correlated wrong vote — converting a false-flag into a no-vote that lets the oracle carry existence.
+21. **(C) Aggregators that assume independent judge errors (majority, averaging) gain little or amplify mistakes; explicitly modeling the shared confounder is more reliable.** Zhao et al. 2026 (arXiv:2603.00039). Implication: the aggregation rule must model the training-cutoff blind spot as a shared confounder and discount correlated "fabricated" votes when the un-confounded oracle confirms existence — formalizing why the cascade beats flat voting.
+22. **(D) Training a model to emit "I don't know" as a first-class refusal yields better-calibrated uncertainty than post-hoc thresholding, and the refusal skill generalizes out-of-domain.** Zhang et al. 2023 (arXiv:2311.09677). Implication: keep `CANNOT_CONFIRM` a *first-class* verdict the verifier is instructed to produce — do not collapse the halt table to accept/reject plus a confidence cut.
+23. **(D) Conformal uncertainty gives a finite-sample statistical guarantee on the correctness-coverage rate of the answered set across many models and free-form tasks while keeping prediction sets small.** Wang et al. 2024 (arXiv:2407.00499). Implication: tune the abstention threshold with conformal calibration so the *accepted* citation set carries a provable error bound (e.g. "≤5% of confirmed citations are wrong") with a tunable risk knob.
+24. **(D) Conformal factuality "backs off" to less-specific output and abstains on uncertain sub-claims, giving 80–90% correctness guarantees while retaining most of the output.** Mohri & Hashimoto 2024 (arXiv:2402.10978). Implication: abstention need not be all-or-nothing per citation — partially confirm the supported molecular claims and escalate only the unconfirmable one, preserving coverage.
+25. **(D) Entropy / raw confidence alone is insufficient for safe abstention because models are confidently wrong; combining it with an external correctness signal is required.** Phillips et al. 2026 (arXiv:2603.21172). Implication: gate abstention on *external evidence presence* (was the source fetched, does the retrieved text contain the claim) — not on the verifier's own entropy or verbalized confidence.
+26. **(D) Trust-induced over-reliance is large, and always-on/non-adaptive explanations backfire — only trust-gated, selectively-surfaced counter-explanations reduce inappropriate reliance.** Srinivasan & Thomason 2025 (arXiv:2502.13321). Implication: surface `CANNOT_CONFIRM` *contrastively and selectively* ("I expected to find X and didn't"), never as an always-on confidence bar a human will rubber-stamp.
+27. **(D) Refusal-aware tuning has a documented over-refusal failure mode that must be actively balanced against coverage.** Zhu et al. 2025 (arXiv:2502.05911). Implication: instrument and cap the abstain/escalation rate against a labeled holdout, and treat an abstain-rate spike as its own ANDON trigger — not a success.
+## Step 4 — External verification
+**Run against this dispatch's own 27 citations before Step 5 was written.** Synthesizer = Claude/Opus; verifier families = the deterministic arXiv oracle + Mistral (`mistral-small:24b`) + IBM Granite (`granite4.1:30b`), reasoning-stripped (lenses saw only the bare claim + the source title/abstract — never the implications or any synthesizer reasoning).
+- [x] every citation resolved by retrieval (arXiv/DOI), not model memory — structured arXiv API
+- [x] every finding matches what its source actually claims (groundedness) — two different families vs each abstract
+- [x] >= 3 decorrelated lenses (retrieval oracle + >= 2 different model families) — arXiv oracle + Mistral + Granite
+**Existence / attribution (retrieval oracle).** All **27/27** papers resolved with correct titles and years. **0 fabricated.** Five attribution corrections the oracle made that no parametric model could: CiteCheck authors `Anonymous → Khajavi et al.` (#14); DnDScore author list trimmed to Wanner, Van Durme & Dredze (#6); R-Tuning year `2024 → 2023` (#22); SConU year `2024 → 2025`; **GRAIT first author `Fang → Zhu` — a real misattribution the research agent flagged itself, corrected once (#27).**
+**Postdated-paper check.** Six 2025–2026 papers (#12 Saxena, #13 Rao, #14 Khajavi, #15 Kohli, #21 Zhao, #25 Phillips) — which a parametric LLM would false-flag as fabricated — were all **oracle-confirmed real**. This is the existence-must-be-retrieval thesis, executed.
+**Groundedness (two different-family lenses vs each abstract).** Core qualitative claims **SUPPORTED** by both lenses. Precise figures that live in a paper's *body* but not its *abstract* were correctly flagged PARTIAL/NOT by the lenses and **softened to the abstract-grounded claim** in Step 3 (e.g. #3's "66%" → "the majority"; #6's "33→51.6%"; #7's "74.7 vs 68.7"; #10's "2% vs 18–20%"; #16's veto magnitudes; #26's per-condition deltas). No finding was dropped; none was fabricated; none mis-first-authored after correction.
+**The dispatch demonstrated its own findings, live:** (a) the lenses flagged exactly the *overstated-number* zone finding **A** is about; (b) Mistral returned a confident `NOT_SUPPORTED` on #21/CARE whose abstract is entirely about confounder modeling — a "confidently-wrong judge" (findings 19, 25) — while Granite was correct, and the **disagreement** triggered adjudication (finding 18); (c) both lenses under-credited material literally present in abstracts (#1 estimator, #9 TruthfulQA, #13 Wayback) — correlated lens noise (findings 15, 21) that only the deterministic oracle is immune to; (d) abstention fired on *evidence absence* (number-not-in-abstract), not model entropy — exactly finding **25**'s prescription. **No verifier was Claude; the protocol did not grade its own homework.**
+## Step 5 — Architecture (study-swarm v1.1)
+Each choice traces to findings by number.
+- **A1 — Stage-2 becomes molecular-claim decomposition, not whole-sentence NLI.** Decompose each finding into *molecular* claims (decontextualized + minimal), informativeness-filter to the load-bearing claim, NLI-check each against the source, and score *fraction-supported*. (findings 1, 5, 6, 7)
+- **A2 — The groundedness verdict is ternary.** Fully / partially / not supported; "partially supported" (the link resolves, the paper is real, the sentence overstates) routes to **correct-once or escalate**, never auto-pass. (findings 2, 3)
+- **A3 — Pin the decomposer; don't score by subclaim count.** The verdict is sensitive to the decomposition method, so the decomposer prompt/model is pinned per run and padding earns no credit. (findings 4, 5)
+- **B1 — Step 2 mandates retrieval-grounded generation.** Agents browse-then-cite, cite only fetched sources, and *drop* (not invent) a claim they cannot ground — a lightweight in-loop "is this in a fetched source?" check. (findings 8, 9, 10, 11, 14)
+- **B2 — …but keep the Step-4 gate and add coverage recovery.** Generation-time grounding maximizes precision at the cost of coverage, so a post-hoc sweep recovers true-but-hard-to-retrieve findings, and the deterministic existence oracle stays even under generation-time grounding. (findings 9, 12, 13)
+- **C1 — The aggregation rule is the cascade.** Existence is gated **authoritatively by the deterministic oracle** (no LLM vote — the only genuinely decorrelated lens); groundedness uses the LLM lenses only. (findings 15, 21)
+- **C2 — Groundedness uses a tuned minority-veto, not disjunction or majority.** The veto threshold `n` is the explicit knob trading trap-catch against over-rejection; a small labeled calibration set + bias correction beats adding lenses. (findings 16, 17)
+- **C3 — Lens disagreement escalates; it never auto-rejects.** When the oracle confirms existence but the groundedness lenses split — especially on a post-cutoff paper — the dispatch escalates to a human rather than rejecting genuine recent work, and confident "fabricated" flags on recent papers are down-weighted. (findings 18, 19, 20, 21)
+- **D1 — `CANNOT_CONFIRM` stays a first-class verdict.** It is *not* collapsed into accept/reject + a confidence cut; the verifier is instructed to produce it. (finding 22)
+- **D2 — Abstention is conformally calibrated and evidence-gated.** The threshold is tuned for a provable accepted-set error bound; abstention triggers on **external evidence absence**, never the verifier's own entropy/verbalized confidence; partial confirmation preserves the supported claims. (findings 23, 24, 25)
+- **D3 — Surface contrastively, and cap the abstain rate as an ANDON.** `CANNOT_CONFIRM` is shown contrastively/selectively (not an always-on confidence bar), and an abstain-rate spike on a labeled holdout is itself a halt signal, not a success. (findings 26, 27)
+**Net:** the verifier-protected envelope is unchanged in spirit but specified where v1.0.0 was silent — *how* groundedness is checked (molecular decomposition, ternary), *how* the lenses are combined (oracle-gated cascade + minority-veto + disagreement-escalation), *when* the research step grounds (generation-time floor + coverage recovery), and *how* abstention is calibrated and surfaced (first-class, conformal, evidence-gated, capped). Every one of these is both retrieval-verified above and was demonstrated on this very dispatch.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@dogfood-lab/study-swarm",
-  "version": "0.6.0",
+  "version": "1.1.0",
   "description": "Ground design decisions in cited research, then verify every citation with a different model family before it becomes canon — a research-grounded design protocol, with a thin CLI.",
   "keywords": [
     "methodology",
@@ -34,6 +34,7 @@
   },
   "files": [
     "bin/",
+    "examples/",
     "README.md",
     "README.ja.md",
     "README.zh.md",