@dogfood-lab/study-swarm 0.6.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.zh.md CHANGED
@@ -13,77 +13,114 @@
13
13
  <img src="https://img.shields.io/badge/cited%20research-verified-1f6feb" alt="Cited research, verified">
14
14
  </p>
15
15
 
16
- **将设计决策建立在引用的研究基础上——然后在将任何内容确立为标准之前,使用*不同的*模型系列来验证这些引用。**
16
+ **将设计决策建立在引用的研究基础上——然后,在使用*不同的*模型系列验证引用之前,确保其成为正式内容。**
17
17
 
18
- `study-swarm` 是一种协议,而不是一种工具。当您使用大型语言模型 (LLM) 做出重大的设计决策时——例如,创建一个新的产品层、选择一种架构,或者决定“我们是否应该在这里信任该模型”——如果仅凭经验进行设计,会导致设计陈旧;如果仅凭记忆引用论文,会导致设计依赖于不存在或与您认为的内容不符的来源。`study-swarm` 替代了这两种方法:同时启动多个研究代理,要求提供具体的引用结果,并在每个引用被用于指导设计之前,通过一个**来自不同模型系列的外部验证器**进行验证。
18
+ `study-swarm`是一种协议,而不是一种工具。当您使用LLM做出重大的设计决策时(例如,新的产品层、架构选择或“我们是否应该信任该模型”),从第一性原理出发进行即兴创作会导致过时的设计,而凭记忆引用论文会导致设计依赖于不存在的来源或与您认为的内容不符。`study-swarm`取代了这两种方法:派遣并行研究代理,要求提供具体的引用结果,并在每个引用通过**不同模型系列的外部验证器**后才将其用于指导设计。
19
19
 
20
- 它也适用于自身。该协议规定,对于它所帮助设计的系统,应使用经过验证器保护的机制——因此,它也将其应用于自身。**没有模型会自己批改作业,包括运行该协议的模型。**
20
+ 它也适用于自身。该协议规定,对于其帮助设计的系统,应使用经过验证器的保护机制——因此,它也在自身上运行这种机制。**没有模型会自己批改作业,包括运行该协议的模型。**
21
21
 
22
- ## 该协议包含五个步骤
22
+ ## 该协议包含五个步骤:
23
23
 
24
- 1. **确定** 3-5 个关键的设计问题,如果存在经验证据,这些证据可能会改变答案。
25
- 2. **同时启动**每个问题的研究代理。每个代理必须返回论文标题 + 作者 + 年份 + URL + 一句话的结论——强调具体性而非广泛性(“6-8 个来源可靠的结论胜过 20 个模糊的描述”)。
26
- 3. **将结论综合**到“研究依据”部分:`N. **<结论>.** <作者> <年份> (<arXiv/DOI>). <设计意义>.`
27
- 4. **进行外部验证**——一个*不同的模型系列*,不带推理能力,以两个阶段检查每个引用:一个**检索预言机**确认论文是否存在(永远不是模型的记忆),然后一个**可靠性**过滤器确认结论与来源是否匹配。如果发现捏造或错误引用,则**停止**;如果验证器或检索预言机不可用,则**停止并升级**(永远不要将无法找到的情况视为“引用没问题”)。
28
- 5. **将每个架构选择与一个结论联系起来**,通过编号进行关联。没有设计意义的引用是噪音。
24
+ 1. **确定** 3-5个关键设计问题,如果存在经验证据,答案可能会发生变化。
25
+ 2. **派遣**每个问题的研究代理,并行进行。每个代理必须返回论文标题+作者+年份+URL+一个句子摘要——强调具体性而非广泛性(“6-8条有充分依据的发现胜过20个模糊的要点”)。
26
+ 3. **综合**这些发现,形成一份*研究基础*部分:`N. **<发现>.** <作者> <年份> (<arXiv/DOI>) <设计意义>.`
27
+ 4. **进行外部验证**——一个*不同的模型系列*(不带推理能力),以两个阶段检查每个引用:一个**检索预言机**确认论文是否存在(永远不是模型的记忆),然后一个**真实性**过滤器确认该发现与来源是否匹配。如果出现捏造/错误归因,则**停止**;如果验证器或检索预言机不可用,则**停止并升级**(切勿将无法访问视为“引用没问题”)。
28
+ 5. **将**每个架构选择与编号的发现联系起来。没有设计意义的引用就是噪音。
29
29
 
30
- 完整的可执行细节——停止表、来源标准、集成规则——位于 **[PROTOCOL.md](PROTOCOL.md)** 中。
30
+ 完整的可执行细节——停止表、来源标准、集成规则——位于**[PROTOCOL.md](PROTOCOL.md)**中。
31
31
 
32
- ## 为什么使用*不同的*模型系列,并且不带推理能力?
32
+ ## 为什么需要*不同的*模型系列,并且不带推理能力?
33
33
 
34
- 因为失败模式是已知的,而不是假设的:
34
+ 因为其失败模式是已知的,而不是假设的:
35
35
 
36
- - **大型语言模型无法可靠地验证其自身的输出。** Huang 等人,2023 年 ([arXiv:2310.01798](https://arxiv.org/abs/2310.01798));Kambhampati 等人,2024 年 ([arXiv:2402.01817](https://arxiv.org/abs/2402.01817),LLM-Modulo);Stechly 等人,2024 年 ([arXiv:2402.08115](https://arxiv.org/abs/2402.08115))——外部验证器可以带来收益;自我批评的内容是无效的。
37
- - **同一系列的评估者会偏向于自我。** Panickssery、Bowman Feng,2024 年 ([arXiv:2404.13076](https://arxiv.org/abs/2404.13076))——自我识别与自我偏好呈*线性*相关,因此部分隐藏并不能提供帮助。Verga 等人,2024 年 ([arXiv:2404.18796](https://arxiv.org/abs/2404.18796),PoLL))——来自不同系列的评估小组的偏见更小,成本约为原来的 1/7。
38
- - **大型语言模型最容易在引用方面撒谎。** Walters Wilder,2023 年 ([doi:10.1038/s41598-023-41032-5](https://doi.org/10.1038/s41598-023-41032-5))——55% 的 GPT-3.5 / 18% GPT-4 引用是捏造的。Onweller 等人,2026 年 ([arXiv:2605.06635](https://arxiv.org/abs/2605.06635))——链接可以解决超过 94% 的问题,但只有 39-77% 的引用内容实际上支持该主张。因此,必须通过**检索**来检查是否存在,而不是通过**回忆**。
39
- - **隐藏生成器的推理过程。** Khalifa 等人,2026 年 ([arXiv:2601.14691](https://arxiv.org/abs/2601.14691),“欺骗评估者”)——仅通过操纵思维链,就可以使评估者的假阳性率提高高达 90%,而操作保持不变。Turpin 等人,2023 年 ([arXiv:2305.04388](https://arxiv.org/abs/2305.04388))——思维链是一种事后合理化。验证器看到的是裸露的引用主张,而不是“我为什么包含这个”。
40
- - **多样性胜过数量。** Rajan,2025 年 ([arXiv:2511.16708](https://arxiv.org/abs/2511.16708))——四个验证器,成对相关性 ρ ∈ [0.05, 0.25],通过亚模覆盖胜过任何一个智能评估者。Kim 等人,2025 年 ([arXiv:2506.07962](https://arxiv.org/abs/2506.07962))——大型语言模型的错误是*相关的*,因此关键变量是视角多样性,而不是原始数量。
36
+ - **LLM无法可靠地验证自身的输出。** Huang等人,2023年([arXiv:2310.01798](https://arxiv.org/abs/2310.01798));Kambhampati等人,2024年([arXiv:2402.01817](https://arxiv.org/abs/2402.01817),LLM-Modulo);Stechly等人,2024年([arXiv:2402.08115](https://arxiv.org/abs/2402.08115))——外部验证器具有优势;自我批评的内容是无效的。
37
+ - **同一系列的评判者会偏袒自己。** Panickssery、Bowman和Feng,2024年([arXiv:2404.13076](https://arxiv.org/abs/2404.13076))——自我识别与自我偏好呈线性相关,因此部分屏蔽没有帮助。Verga等人,2024年([arXiv:2404.18796](https://arxiv.org/abs/2404.18796),PoLL)——跨越不同系列的评审团的偏见更小,成本约为原来的 1/7。
38
+ - **引用是LLM说谎的地方。** Walters和Wilder,2023年([doi:10.1038/s41598-023-41032-5](https://doi.org/10.1038/s41598-023-41032-5))——GPT-3.5的55%和GPT-4的18%的引用是捏造的。Onweller等人,2026年([arXiv:2605.06635](https://arxiv.org/abs/2605.06635))——链接可以解决超过94%的问题,但只有39-77%的引用内容实际上支持该声明。因此,必须通过**检索而不是回忆**来检查是否存在。
39
+ - **隐藏生成器的推理过程。** Khalifa等人,2026年([arXiv:2601.14691](https://arxiv.org/abs/2601.14691),“欺骗评判者”)——仅通过操纵思维链,可以将评判者的假阳性率提高高达90%,同时保持操作不变。Turpin等人,2023年([arXiv:2305.04388](https://arxiv.org/abs/2305.04388))——思维链是一种事后合理化。验证器只会看到裸露的引用声明,而不会看到“我为什么包含这个”。
40
+ - **多样性胜过数量。** Rajan,2025年([arXiv:2511.16708](https://arxiv.org/abs/2511.16708))——四个验证器在成对相关系数ρ∈[0.05, 0.25]时,通过亚模覆盖优于任何单个验证器。Kim等人,2025年([arXiv:2506.07962](https://arxiv.org/abs/2506.07962))——LLM的错误是*相关的*,因此关键变量是视角多样性,而不是原始数量。
41
41
 
42
42
  ## 它真的有效吗?(证明)
43
43
 
44
- 作为一项测试,该协议被应用于其自身的引用。两个不相关的非 Claude 系列——**Mistral** (`mistral-small:24b`) 和 **IBM Granite** (`granite4.1:30b`)——检查了一组引用,并且不带推理能力,并设置了两个盲目陷阱:
44
+ 作为测试,该协议针对其自身的引用进行了运行。两个不相关的非Claude系列——**Mistral** (`mistral-small:24b`)和**IBM Granite** (`granite4.1:30b`)——以无推理的方式检查了一组引用,并设置了两个盲目陷阱:
45
45
 
46
- | 设置的陷阱 | Mistral | IBM Granite | 真实情况 |
46
+ | 设定的陷阱 | Mistral | IBM Granite | 真实情况 |
47
47
  |---|---|---|---|
48
- | 思维链提示归因于“Nakamura & Olsen” | 未发现 | **发现**(错误归因 → 实际上是 Wei 等人,2022 年) | 错误归因 |
49
- | 一个捏造的“98% 的错误已消除,不需要预言机”论文 | **caught** (fabricated) | **caught** (fabricated) | 捏造 |
48
+ | 思维链提示归因于“Nakamura & Olsen” | 未发现 | **发现**(错误归因→实际上是Wei等人,2022年,arXiv:2201.11903) | 错误归因 |
49
+ | 一篇捏造的“98%的错误已消除,无需预言机”论文 | **caught** (fabricated) | **caught** (fabricated) | 捏造 |
50
50
 
51
- 两个模型单独都没有发现这两个陷阱——但它们的**组合发现了 2/2 个**。如果只有一个评估者,它会忽略错误归因。此外,检索预言机发现了我们自己设计文档中的两个*真实*的错误归因(引用了错误的作者),而任何参数化的大型语言模型都无法标记——并且它正确地确认了 2026 年的真实论文,而这两个大型语言模型都错误地将其标记为捏造,仅仅是因为这些论文的发布时间晚于它们的训练时间。最后一点是,步骤 4 的存在性检查**必须**是一个检索预言机,而不是大型语言模型。
51
+ 两个系列都没有单独发现这两个陷阱——但它们的**组合发现了2/2个。** 单个评判者会忽略错误归因。此外,检索预言机还发现了我们自己设计文档中的两个*真实*的错误归因(引用了错误的作者),而任何参数LLM都无法标记出来——并且它正确地确认了真正的2026年的论文,这两个LLM因为该论文发布时间晚于它们的训练数据而将其错误地标记为捏造。后一点是第4步存在性检查**必须**使用检索预言机的原因,而不是LLM。
52
52
 
53
- 这一个测试就是缩影:**不相关的视角 + 用于验证存在性的检索预言机,胜过任何一个智能评估者。**
53
+ 这次运行就是缩影:**不相关的视角+用于验证存在的检索预言机胜过任何一个聪明的评判者。**
54
54
 
55
- ## 它的工作方式
55
+ ## 它的工作原理
56
56
 
57
- 您可以手动运行该协议——任何不同的模型系列,加上您自己解析 arXiv/DOI,就可以满足步骤 4 的要求。两个辅助工具可以将其简化为一个命令:
57
+ 你可以手动运行该协议——任何不同的模型加上自行解析 arXiv/DOI,即可满足步骤 4。两个辅助工具使其成为一个命令:
58
58
 
59
- - **[prism-verify](https://github.com/mcp-tool-shop-org/prism-verify)** — 运行时验证器:不同类型的路由、去除推理过程、多角度仲裁、确定性的检索存在性验证(arXiv → Crossref)以及带签名的收据。
60
- - **[role-os](https://github.com/mcp-tool-shop-org/role-os)** 提供 `roleos verify-citations <dispatch>` 命令,该命令提取某个任务的引用,并通过 prism 进行验证。
59
+ - **[prism-verify](https://github.com/mcp-tool-shop-org/prism-verify)** ——运行时验证器:不同模型的路由、去除推理过程的多镜头仲裁、确定性的检索存在性基准(arXiv → Crossref)以及带签名的收据。
60
+ - **[role-os](https://github.com/mcp-tool-shop-org/role-os)** ——提供 `roleos verify-citations <dispatch>`,该工具提取一个“dispatch”中的引用并将其传递给 prism 进行验证。
61
61
 
62
- ## 命令行界面
62
+ 数据传输是“dispatch”格式本身:以 `N. **finding.** Authors year (arXiv|DOI). implication.` 形式编写的发现——每个发现都**包含一个可解析的标识符**——这正是 `roleos verify-citations` 工具所处理和验证的内容。如果“dispatch”符合 `lint` 的要求,则可以顺利进行;如果引用格式不正确,该工具会将其标记为未解析。`study-swarm lint` 会在本地检查此约定,因此步骤 3 和步骤 4 对引用的定义保持一致。
63
+
64
+ ## 命令行界面 (CLI)
63
65
 
64
66
  ```bash
65
67
  npm i -g @dogfood-lab/study-swarm # or run ad-hoc: npx @dogfood-lab/study-swarm <command>
66
68
  ```
67
69
 
68
- | 命令 | 功能 |
70
+ | 命令 | 作用 |
69
71
  |---|---|
70
- | `study-swarm protocol` | 打印完整的协议——五个步骤、终止表、来源标准。 |
71
- | `study-swarm new <slug>` | 生成一个 `<slug>.dispatch.md` 文件,其中包含五个步骤的框架,以便进行填充。 |
72
- | `study-swarm lint <file>` | 检查某个任务的*研究依据*是否符合来源标准——每个发现都需要有作者、年份和可解析的标识符(arXiv / DOI / URL);“研究表明……”这种含糊的说法将被拒绝。如果存在违规,则返回 `1`,从而阻止 CI |
72
+ | `study-swarm protocol` | 打印完整的协议——五个步骤、停止表和来源标准。 |
73
+ | `study-swarm new <slug>` | 创建一个 `<slug>.dispatch.md` 文件,其中包含五步流程的框架,以便进行填充。 |
74
+ | `study-swarm lint [--json] <path…>` | 检查“dispatch”的*研究依据*是否符合来源标准——每个发现都需要作者、年份和一个可解析的标识符(arXiv / DOI / URL);“研究表明……”这种泛泛而谈的方式将被拒绝。如果存在违规行为,则返回 `1`,从而阻止 CI 流程。`<path>` 可以是文件、目录(递归地对所有 `*.dispatch.md` 文件进行 lint 检查),或者 `-` 表示标准输入;`--json` 会输出机器可读的报告。 |
75
+
76
+ `lint` 是确定性的——不调用任何模型——因此可以在 CI 中安全使用。它在本地强制执行**步骤 3 的来源标准**;基于模型的**步骤 4** 验证仍然依赖于 [`roleos verify-citations`](https://github.com/mcp-tool-shop-org/role-os) → prism。
73
77
 
74
- `lint` 命令是确定性的——不调用任何模型——因此可以在 CI 中安全地使用。它在本地强制执行**第三步的来源标准**;基于模型的**第四步**验证仍然依赖于 [`roleos verify-citations`](https://github.com/mcp-tool-shop-org/role-os) → prism。
78
+ 典型的流程:
79
+
80
+ ```bash
81
+ study-swarm new my-decision # creates my-decision.dispatch.md
82
+ # …fill in the questions, run the research dispatch, write the findings…
83
+ study-swarm lint my-decision.dispatch.md # enforce the sourcing standard (Step 3)
84
+ roleos verify-citations my-decision.dispatch.md # model-based Step 4 (different family, via prism)
85
+ ```
86
+
87
+ 一个完整的、符合 `lint` 要求的“dispatch”——将 study-swarm 应用于其自身的设计——包含在 [`examples/study-swarm-self.dispatch.md`](examples/study-swarm-self.dispatch.md) 中,作为参考示例。
88
+
89
+ ### 在 CI 中进行验证
90
+
91
+ `lint` 接受文件、目录(递归地对所有 `*.dispatch.md` 文件进行 lint 检查),或者 `-` 表示标准输入,并且 `--json` 会输出机器可读的报告。将其添加到你的仓库中,以便在每次 PR 中验证每个“dispatch”的来源(一个复制粘贴示例也包含在 [`examples/study-swarm-ci.yml`](examples/study-swarm-ci.yml) 中):
92
+
93
+ ```yaml
94
+ # .github/workflows/dispatches.yml
95
+ name: study-swarm lint
96
+ on:
97
+ pull_request:
98
+ paths: ['**/*.dispatch.md', '.github/workflows/dispatches.yml']
99
+ workflow_dispatch:
100
+ concurrency:
101
+ group: ${{ github.workflow }}-${{ github.ref }}
102
+ cancel-in-progress: true
103
+ jobs:
104
+ lint:
105
+ runs-on: ubuntu-latest
106
+ steps:
107
+ - uses: actions/checkout@v4
108
+ - uses: actions/setup-node@v4
109
+ with: { node-version: '20' }
110
+ - run: npx @dogfood-lab/study-swarm@latest lint dispatches/
111
+ ```
75
112
 
76
- ## 为什么它有效,一句话概括
113
+ ## 为什么它有效,一句话概括:
77
114
 
78
- **及时性**——该领域发展迅速;要求提供具体的、带有年份的研究,可以防止设计落后 18 个月。**实用性**——证据表明哪些*失败*,而不仅仅是哪些有效(解释可能会增加对*错误*人工智能的过度依赖——Bansal 等人,2021 年)。**安全性**——由验证器保护的范围是证据支持的架构,并且协议对其自身的输出进行强制执行。来源不是学术表演;它是证据链。
115
+ **当前**——该领域发展迅速;要求提供具体的、带有年份的研究,可以防止设计落后 18 个月。**功能性**——证据表明哪些*失败*了,而不仅仅是哪些有效(解释可能会增加对*错误* AI 的过度依赖——Bansal 等人,2021 年,[arXiv:2006.14779](https://arxiv.org/abs/2006.14779))。**安全性**——由验证器保护的范围是证据支持的架构,并且该协议对其自身输出进行强制执行。来源不是学术游戏;它是证据链。
79
116
 
80
117
  ## 安全性
81
118
 
82
- `study-swarm` 是一个文档仓库——包含 Markdown 文件和徽标。它不包含任何可执行代码,也不从该仓库安装任何内容。它不涉及任何数据,不需要任何权限,也不收集任何遥测数据;源代码中没有秘密或凭据。该方法*描述*了一种使用网络检索和基于模型的验证的工作流程,但此仓库不实现或运行该工作流程。请参阅 [SECURITY.md](SECURITY.md)。
119
+ `study-swarm` 提供一个**轻量级、零依赖 CLI** (`study-swarm`) 以及该方法论。它**不进行任何网络或模型调用,也不收集任何遥测数据**;源代码中没有秘密或凭据。在运行时,它只会读取你传递给 `lint` 的文件,并在当前目录中写入一个 `<slug>.dispatch.md` 文件(拒绝覆盖,并且绝不会超出工作目录)。该方法论描述的基于模型的验证(步骤 4)由辅助工具执行,而不是由此软件包执行。请参阅 [SECURITY.md](SECURITY.md)。
83
120
 
84
121
  ## 状态
85
122
 
86
- 一个可行的协议,由其自身的机制进行外部验证——不同的模型家族检查其引用(参见上面的证明)。此仓库是公共参考;[PROTOCOL.md](PROTOCOL.md) 是可执行的形式。它是 [dogfood-lab](https://github.com/dogfood-lab) 系列的一部分——用于在人工智能时代构建的方法和示例。
123
+ 一个可工作的协议,通过其自身的机制进行了外部验证——不同的模型会检查其引用(参见上面的证明)。该仓库是公共参考;[PROTOCOL.md](PROTOCOL.md) 是可执行的形式。它是 [dogfood-lab](https://github.com/dogfood-lab) 系列的一部分——用于构建 AI 时代的方法和示例。
87
124
 
88
125
  采用 MIT 许可。
89
126
 
package/SECURITY.md CHANGED
@@ -1,15 +1,15 @@
1
1
  # Security Policy
2
2
 
3
- `study-swarm` is a **documentation repository** it contains the study-swarm methodology (Markdown) and a logo asset. It ships no executable code, no compiled artifacts, and installs nothing from this repository. (The npm name `@dogfood-lab/study-swarm` is a reserved placeholder; this repo is the methodology source, not the package.)
3
+ `study-swarm` is the study-swarm methodology (Markdown) plus a **thin, zero-dependency command-line tool**, published as the npm package `@dogfood-lab/study-swarm`. The CLI ships in the package (`bin/study-swarm.mjs`), so installing it exposes a `study-swarm` executable. It has **no runtime dependencies** and makes **no network or model calls** the model-based verification the methodology describes (Step 4) is run by separate tools, not by this package.
4
4
 
5
5
  ## Threat model
6
6
 
7
- - **What it touches:** nothing at runtime. There is no program to run; reading the docs executes no code.
8
- - **What it does NOT touch:** your filesystem, network, credentials, or environment.
9
- - **Telemetry:** none. **Secrets/credentials:** none in source.
10
- - **Permissions required:** none.
7
+ - **What it runs:** a small Node CLI (Node >= 18). `protocol`, `version`, and `help` only print text. `lint <file>` **reads** the file you name. `new <slug>` **writes** exactly one file — `<slug>.dispatch.md` — in the current working directory, and refuses to overwrite an existing file. The slug is sanitized to a single filename (path separators are replaced with `-`, pure-dots slugs rejected), so `new` cannot write outside the current directory.
8
+ - **What it does NOT do:** no network access, no model calls, no telemetry, no filesystem access beyond the two cases above, no use of credentials or environment beyond what Node needs to run.
9
+ - **Secrets/credentials:** none in source or output.
10
+ - **Permissions required:** filesystem read for `lint`; one-file write (in the working directory) for `new`. Nothing else.
11
11
 
12
- The methodology *describes* a workflow that uses web retrieval and model-based verification, but this repository does not implement or execute that workflow.
12
+ The methodology *describes* a workflow that uses web retrieval and model-based verification; those are performed by the sibling tools ([prism-verify](https://github.com/mcp-tool-shop-org/prism-verify), [role-os](https://github.com/mcp-tool-shop-org/role-os)), not by this package.
13
13
 
14
14
  ## Supported versions
15
15
 
@@ -1,13 +1,15 @@
1
1
  #!/usr/bin/env node
2
2
  // study-swarm — thin CLI for the research-grounded design protocol.
3
3
  // Zero runtime dependencies. Commands: protocol | new | lint | help | version.
4
- import { readFileSync, writeFileSync, existsSync } from 'node:fs';
4
+ import { readFileSync, writeFileSync, existsSync, statSync, readdirSync } from 'node:fs';
5
5
  import { fileURLToPath } from 'node:url';
6
- import { dirname, resolve } from 'node:path';
6
+ import { dirname, resolve, join } from 'node:path';
7
+ import { createHash } from 'node:crypto';
7
8
 
8
9
  const __dirname = dirname(fileURLToPath(import.meta.url));
9
10
  const PKG = JSON.parse(readFileSync(resolve(__dirname, '../package.json'), 'utf8'));
10
11
  const VERSION = PKG.version;
12
+ const PROTOCOL_PATH = resolve(__dirname, '../PROTOCOL.md');
11
13
 
12
14
  const HELP = `study-swarm v${VERSION} — ground design decisions in cited research, then verify.
13
15
 
@@ -15,17 +17,24 @@ USAGE
15
17
  study-swarm <command> [args]
16
18
 
17
19
  COMMANDS
18
- protocol Print the locked protocol (the five steps + halt rules).
19
- new <slug> Scaffold a dispatch file <slug>.dispatch.md to fill in.
20
- lint <file> Check a dispatch's citations against the sourcing standard.
21
- help Show this help.
22
- version Print the version.
20
+ protocol Print the locked protocol (the five steps + halt rules).
21
+ new <slug> Scaffold a dispatch file <slug>.dispatch.md to fill in.
22
+ lint [--json] <path...> Check dispatches' citations against the sourcing standard.
23
+ A <path> may be a file, a directory (linted recursively for
24
+ *.dispatch.md), or "-" to read one dispatch from stdin.
25
+ help Show this help.
26
+ version Print the version.
23
27
 
24
28
  EXIT CODES
25
29
  0 ok / lint clean
26
30
  1 lint found sourcing violations
27
31
  2 usage or runtime error
28
32
 
33
+ NOTE
34
+ lint checks citation FORM (Step 3: author + year + a resolvable arXiv/DOI/URL,
35
+ no "studies show…" gestures) — it does not judge whether a source is legitimate
36
+ or actually supports the claim. That is Step 4, below.
37
+
29
38
  Run a dispatch's model-based verification with: roleos verify-citations <file>
30
39
  Docs: https://dogfood-lab.github.io/study-swarm/
31
40
  `;
@@ -35,19 +44,29 @@ function fail(code, msg) {
35
44
  process.exit(code);
36
45
  }
37
46
 
47
+ // Short hash of the vendored PROTOCOL.md, so a scaffolded dispatch records the exact
48
+ // methodology version it was authored against (the package vendors PROTOCOL.md for this).
49
+ function protocolHash() {
50
+ try { return createHash('sha256').update(readFileSync(PROTOCOL_PATH)).digest('hex').slice(0, 16); }
51
+ catch { return 'unknown'; }
52
+ }
53
+
38
54
  function cmdProtocol() {
39
- const p = resolve(__dirname, '../PROTOCOL.md');
40
- if (!existsSync(p)) fail(2, 'PROTOCOL.md not found in package');
41
- process.stdout.write(readFileSync(p, 'utf8'));
55
+ if (!existsSync(PROTOCOL_PATH)) fail(2, 'PROTOCOL.md not found in package');
56
+ try { process.stdout.write(readFileSync(PROTOCOL_PATH, 'utf8')); }
57
+ catch (err) { fail(2, `cannot read PROTOCOL.md in package: ${err && err.code ? err.code : err.message}`); }
42
58
  }
43
59
 
44
- const template = (slug) => `# Study-swarm dispatch: ${slug}
60
+ const template = (slug, stamp) => `<!-- ${stamp} -->
61
+ # Study-swarm dispatch: ${slug}
45
62
 
46
63
  > Fill in each section. Verify citations (Step 4) BEFORE connecting findings to the design (Step 5).
47
64
  > Lint the sourcing with: study-swarm lint ${slug}.dispatch.md
48
65
 
49
66
  ## Step 1 — Load-bearing questions
50
- <!-- 3-5 questions where empirical evidence would change the answer. Fewer is fine if the decision is substantial. -->
67
+ <!-- 3-5 questions where empirical evidence would change the answer. Fewer is fine if the decision is substantial.
68
+ A question is load-bearing if you can picture two designs hinging on the answer and the honest current
69
+ answer is "I think…", not "evidence says…". Don't manufacture questions to hit a count. -->
51
70
  1.
52
71
  2.
53
72
  3.
@@ -57,7 +76,8 @@ const template = (slug) => `# Study-swarm dispatch: ${slug}
57
76
 
58
77
  ## Step 3 — Research grounding
59
78
  <!-- One entry per finding (this is what 'lint' checks):
60
- N. **<finding>.** <Authors> <year> (<arXiv:NNNN.NNNNN | DOI>). <design implication>. -->
79
+ N. **<finding>.** <Authors> <year> (<arXiv:NNNN.NNNNN | DOI>). <design implication>.
80
+ e.g.: 1. **Contrastive explanations with a predicted human foil improve independent decisions.** Buçinca et al. 2024 (arXiv:2410.04253). Implication: every recommendation carries a "you might think X; I chose Y because…" frame. -->
61
81
  1. **<finding>.** <Authors> <year> (arXiv:____.____). <implication>.
62
82
 
63
83
  ## Step 4 — External verification
@@ -73,60 +93,168 @@ const template = (slug) => `# Study-swarm dispatch: ${slug}
73
93
 
74
94
  function cmdNew(slug) {
75
95
  if (!slug) fail(2, 'usage: study-swarm new <slug>');
76
- const safe = String(slug).replace(/\.dispatch\.md$/i, '').replace(/[^\w.\-/]/g, '-');
96
+ // Reduce the slug to a single safe filename: strip any trailing .dispatch.md (even if
97
+ // repeated), then collapse anything that isn't a word char, dot, or hyphen to '-'. Path
98
+ // separators ('/' and '\') are NOT permitted — `new` writes ONE file in the current
99
+ // directory and must never traverse out of it. A pure-dots slug ('.', '..') is rejected.
100
+ const stem = String(slug).replace(/(\.dispatch\.md)+$/i, '');
101
+ const safe = stem.replace(/[^\w.\-]/g, '-');
102
+ if (!safe || /^\.+$/.test(safe)) {
103
+ fail(2, `invalid slug "${slug}" — use letters, digits, '.', or '-' (the file stays in the current directory)`);
104
+ }
77
105
  const out = `${safe}.dispatch.md`;
78
106
  if (existsSync(out)) fail(2, `refusing to overwrite existing ${out}`);
79
- writeFileSync(out, template(safe), 'utf8');
80
- process.stdout.write(`Created ${out}\nFill it in, then: study-swarm lint ${out}\n`);
107
+ // Provenance stamp: pins the methodology version a dispatch was authored against.
108
+ const stamp = `study-swarm v${VERSION} · protocol-sha256:${protocolHash()} · created:${new Date().toISOString().slice(0, 10)}`;
109
+ writeFileSync(out, template(safe, stamp), 'utf8');
110
+ const note = safe === stem ? '' : ` (slug sanitized to "${safe}")`;
111
+ process.stdout.write(`Created ${out}${note}\nFill it in, then: study-swarm lint ${out}\n`);
81
112
  }
82
113
 
83
- function cmdLint(file) {
84
- if (!file) fail(2, 'usage: study-swarm lint <file>');
85
- if (!existsSync(file)) fail(2, `file not found: ${file}`);
86
- const lines = readFileSync(file, 'utf8').split(/\r?\n/);
114
+ // --- lint core ------------------------------------------------------------
87
115
 
88
- const start = lines.findIndex((l) => /^#{1,6}\s.*research grounding/i.test(l));
89
- if (start === -1) fail(1, 'no "Research grounding" section found — every dispatch needs one (Step 3).');
116
+ const YEAR = /\b(19|20)\d{2}\b/;
117
+ const ID = /(arxiv:\s*\d{4}\.\d{4,5}|10\.\d{4,9}\/\S+|https?:\/\/\S+)/i;
118
+ const PLACEHOLDER = /arXiv:_{2,}|<finding>|<authors>|<year>|<implication>/i;
119
+ const BANNED = /\b(studies show|research suggests|it'?s well[- ]established|well[- ]established that)\b/i;
120
+ // An author cite: a capitalized name (Unicode-aware, so "Buçinca" counts), optionally
121
+ // followed by "et al.", "&", "and", or further surnames, immediately before the year.
122
+ // Accepts "Huang et al. 2023", "Walters & Wilder 2023", "Panickssery, Bowman & Feng 2024";
123
+ // flags an author-less finding like "**Foo.** 2024 (arXiv:…)".
124
+ const AUTHOR = /\p{Lu}[\p{L}.'’-]+(?:\s*,?\s*(?:&|and|et al\.?|\p{Lu}[\p{L}.'’-]+))*\s+\(?(?:19|20)\d{2}/u;
125
+
126
+ // Check one dispatch's text. Returns a structured result; never exits.
127
+ function lintText(label, raw) {
128
+ const lines = raw.split(/\r?\n/);
129
+ const problems = []; // { finding, line, rule, message }
130
+ const add = (rule, message, line = null, finding = null) => problems.push({ finding, line, rule, message });
131
+
132
+ // Find the "Research grounding" heading whose TEXT ends with that phrase (last wins), so a
133
+ // title that merely mentions "research grounding" above the real section can't shadow it.
134
+ let start = -1;
135
+ for (let i = 0; i < lines.length; i++) {
136
+ const h = lines[i].match(/^#{1,6}\s+(.*?)\s*$/);
137
+ if (h && /research grounding$/i.test(h[1])) start = i;
138
+ }
139
+ if (start === -1) {
140
+ add('no-section', 'no "Research grounding" section found — every dispatch needs one (Step 3).');
141
+ return { file: label, ok: false, findingCount: 0, problems, findings: [] };
142
+ }
90
143
  let end = lines.length;
91
144
  for (let i = start + 1; i < lines.length; i++) {
92
145
  if (/^#{1,6}\s/.test(lines[i])) { end = i; break; }
93
146
  }
94
147
  const section = lines.slice(start + 1, end);
95
148
 
96
- const YEAR = /\b(19|20)\d{2}\b/;
97
- const ID = /(arxiv:\s*\d{4}\.\d{4,5}|10\.\d{4,9}\/\S+|https?:\/\/\S+)/i;
98
- const PLACEHOLDER = /arXiv:_{2,}|<finding>|<authors>|<year>|<implication>/i;
99
- const BANNED = /\b(studies show|research suggests|it'?s well[- ]established|well[- ]established that)\b/i;
100
-
101
- // Each numbered list item (with its continuation lines) is one finding.
102
- const findings = [];
149
+ // Split into findings (numbered items + continuation lines), ignoring fenced code blocks
150
+ // so a "1." inside a ``` example isn't mistaken for a finding. Track each finding's line.
151
+ const findings = []; // { text, line }
103
152
  let cur = null;
104
- for (const l of section) {
105
- if (/^\s*\d+\.\s/.test(l)) { if (cur !== null) findings.push(cur); cur = l; }
106
- else if (cur !== null && l.trim()) cur += ' ' + l.trim();
107
- }
108
- if (cur !== null) findings.push(cur);
153
+ let inFence = false;
154
+ section.forEach((l, idx) => {
155
+ if (/^\s*(```|~~~)/.test(l)) { inFence = !inFence; return; }
156
+ if (inFence) return;
157
+ if (/^\s*\d+\.\s/.test(l)) { if (cur) findings.push(cur); cur = { text: l, line: start + 2 + idx }; }
158
+ else if (cur && l.trim()) cur.text += ' ' + l.trim();
159
+ });
160
+ if (cur) findings.push(cur);
161
+
162
+ if (findings.length === 0) add('no-findings', 'Research grounding has no numbered findings.');
109
163
 
110
- const problems = [];
111
- if (findings.length === 0) problems.push('Research grounding has no numbered findings.');
164
+ const parsed = [];
112
165
  findings.forEach((f, i) => {
113
166
  const n = i + 1;
114
- if (PLACEHOLDER.test(f)) problems.push(`finding ${n}: still has template placeholders — fill it in.`);
115
- if (!YEAR.test(f)) problems.push(`finding ${n}: missing a year.`);
116
- if (!ID.test(f)) problems.push(`finding ${n}: missing an identifier (arXiv:NNNN.NNNNN, DOI, or URL).`);
167
+ if (PLACEHOLDER.test(f.text)) add('placeholder', `finding ${n}: still has template placeholders — fill it in.`, f.line, n);
168
+ // Strip identifiers before the year check so an arXiv id's YYMM prefix
169
+ // (e.g. 2402 in arXiv:2402.01817) can't masquerade as a publication year.
170
+ const fNoIds = f.text.replace(/arxiv:\s*\d{4}\.\d{4,5}/gi, '').replace(/10\.\d{4,9}\/\S+/g, '');
171
+ if (!YEAR.test(fNoIds)) add('missing-year', `finding ${n}: missing a year (spell it out, e.g. "2024" — an arXiv id alone is not a year).`, f.line, n);
172
+ if (!AUTHOR.test(f.text)) add('missing-author', `finding ${n}: missing an author before the year (e.g. "Huang et al. 2023").`, f.line, n);
173
+ const idm = f.text.match(ID);
174
+ if (!idm) add('missing-id', `finding ${n}: missing an identifier (arXiv:NNNN.NNNNN, DOI, or URL).`, f.line, n);
175
+ const ym = fNoIds.match(YEAR);
176
+ const ident = idm ? idm[0].replace(/\s+/g, '').replace(/[).,;]+$/, '') : null;
177
+ parsed.push({ finding: n, year: ym ? ym[0] : null, identifier: ident });
117
178
  });
118
- lines.forEach((l, i) => {
119
- if (BANNED.test(l) && !ID.test(l)) {
120
- problems.push(`line ${i + 1}: name the study (author + year + identifier), don't gesture: "${l.trim().slice(0, 56)}"`);
179
+
180
+ // Banned gesture anywhere in the section (outside fences): a finding STATES its result,
181
+ // it never "studies show…" a co-located citation doesn't redeem it.
182
+ let fence = false;
183
+ section.forEach((l, idx) => {
184
+ if (/^\s*(```|~~~)/.test(l)) { fence = !fence; return; }
185
+ if (!fence && BANNED.test(l)) {
186
+ add('banned-gesture', `line ${start + 2 + idx}: name the study (author + year + identifier), don't gesture: "${l.trim().slice(0, 56)}"`, start + 2 + idx);
121
187
  }
122
188
  });
123
189
 
124
- if (problems.length) {
125
- process.stderr.write(`x ${file}: ${problems.length} sourcing issue(s)\n`);
126
- for (const p of problems) process.stderr.write(` - ${p}\n`);
127
- process.exit(1);
190
+ return { file: label, ok: problems.length === 0, findingCount: findings.length, problems, findings: parsed };
191
+ }
192
+
193
+ // Recursively collect *.dispatch.md files under a directory (skips node_modules/.git).
194
+ function walkDispatches(dir) {
195
+ const out = [];
196
+ for (const entry of readdirSync(dir, { withFileTypes: true })) {
197
+ if (entry.name === 'node_modules' || entry.name === '.git') continue;
198
+ const full = join(dir, entry.name);
199
+ if (entry.isDirectory()) out.push(...walkDispatches(full));
200
+ else if (/\.dispatch\.md$/i.test(entry.name)) out.push(full);
201
+ }
202
+ return out.sort();
203
+ }
204
+
205
+ function readTarget(p) {
206
+ try { return { label: p, raw: readFileSync(p, 'utf8') }; }
207
+ catch (err) { fail(2, `cannot read ${p}: ${err && err.code ? err.code : err.message}`); }
208
+ }
209
+
210
+ function cmdLint(args) {
211
+ const json = args.includes('--json');
212
+ const paths = args.filter((a) => a !== '--json');
213
+ if (paths.length === 0) fail(2, 'usage: study-swarm lint [--json] <file|dir|-> [more...]');
214
+
215
+ const targets = [];
216
+ for (const p of paths) {
217
+ if (p === '-') {
218
+ let raw;
219
+ try { raw = readFileSync(0, 'utf8'); }
220
+ catch (err) { fail(2, `cannot read stdin: ${err && err.code ? err.code : err.message}`); }
221
+ targets.push({ label: '<stdin>', raw });
222
+ continue;
223
+ }
224
+ if (!existsSync(p)) fail(2, `path not found: ${p}`);
225
+ if (statSync(p).isDirectory()) {
226
+ const files = walkDispatches(p);
227
+ if (files.length === 0) fail(2, `no .dispatch.md files found under ${p}`);
228
+ for (const f of files) targets.push(readTarget(f));
229
+ } else {
230
+ targets.push(readTarget(p));
231
+ }
232
+ }
233
+
234
+ const results = targets.map((t) => lintText(t.label, t.raw));
235
+ const anyFail = results.some((r) => !r.ok);
236
+
237
+ if (json) {
238
+ const payload = results.length === 1 ? results[0] : { ok: !anyFail, files: results };
239
+ process.stdout.write(JSON.stringify(payload) + '\n');
240
+ process.exit(anyFail ? 1 : 0);
241
+ }
242
+
243
+ for (const r of results) {
244
+ if (r.ok) {
245
+ process.stdout.write(`ok ${r.file}: ${r.findingCount} finding(s), all sourced.\n`);
246
+ } else {
247
+ process.stderr.write(`x ${r.file}: ${r.problems.length} sourcing issue(s)\n`);
248
+ for (const pr of r.problems) process.stderr.write(` - ${pr.message}\n`);
249
+ }
250
+ }
251
+ if (!anyFail) {
252
+ process.stdout.write(
253
+ `\nStep 3 (sourcing FORM) is satisfied — this does NOT confirm the citations exist or support the claim.\n` +
254
+ `Run Step 4 (existence + groundedness, a different model family): roleos verify-citations <file>\n`,
255
+ );
128
256
  }
129
- process.stdout.write(`ok ${file}: ${findings.length} finding(s), all sourced.\n`);
257
+ process.exit(anyFail ? 1 : 0);
130
258
  }
131
259
 
132
260
  function main(argv) {
@@ -134,7 +262,7 @@ function main(argv) {
134
262
  switch (cmd) {
135
263
  case 'protocol': return cmdProtocol();
136
264
  case 'new': return cmdNew(rest[0]);
137
- case 'lint': return cmdLint(rest[0]);
265
+ case 'lint': return cmdLint(rest);
138
266
  case 'version': case '--version': case '-v':
139
267
  return void process.stdout.write(VERSION + '\n');
140
268
  case 'help': case '--help': case '-h': case undefined:
@@ -0,0 +1,28 @@
1
+ # Copy this into YOUR repo at .github/workflows/dispatches.yml to gate the sourcing
2
+ # of every study-swarm dispatch on each pull request. It is a SAMPLE — it is not an
3
+ # active workflow in the study-swarm repo itself.
4
+ name: study-swarm lint
5
+
6
+ on:
7
+ pull_request:
8
+ paths:
9
+ - '**/*.dispatch.md'
10
+ - '.github/workflows/dispatches.yml'
11
+ workflow_dispatch:
12
+
13
+ concurrency:
14
+ group: ${{ github.workflow }}-${{ github.ref }}
15
+ cancel-in-progress: true
16
+
17
+ jobs:
18
+ lint:
19
+ runs-on: ubuntu-latest
20
+ timeout-minutes: 5
21
+ steps:
22
+ - uses: actions/checkout@v4
23
+ - uses: actions/setup-node@v4
24
+ with:
25
+ node-version: '20'
26
+ # Lint every dispatch under dispatches/ (a file, a dir, or '-' for stdin all work).
27
+ # Exit 1 on any sourcing violation fails the check. Add --json for machine-readable output.
28
+ - run: npx @dogfood-lab/study-swarm@latest lint dispatches/
@@ -0,0 +1,46 @@
1
+ <!-- study-swarm vX.Y.Z · protocol-sha256:<vendored> · a worked, lint-clean reference dispatch -->
2
+ # Study-swarm dispatch: study-swarm-self
3
+
4
+ > A complete, **lint-clean** example dispatch — study-swarm applied to its own
5
+ > central design decision. Run `study-swarm lint examples/study-swarm-self.dispatch.md`
6
+ > (it passes), then read it as a model for what a filled-in dispatch looks like end to end.
7
+
8
+ ## Step 1 — Load-bearing questions
9
+
10
+ <!-- Each is load-bearing: two real designs hinge on the answer, and the honest prior is "I think", not "evidence says". -->
11
+
12
+ 1. When an LLM makes a substantial design call, can the *same* model reliably verify its own citations, or does the verifier have to be a separate model?
13
+ 2. Is confirming a cited paper *exists* enough, or must "the source supports this claim" be checked as a separate axis?
14
+ 3. Does adding *more* verifiers improve coverage, or does the diversity of the verifiers matter more than their count?
15
+
16
+ ## Step 2 — Research dispatch
17
+
18
+ <!-- One research agent per question, in parallel; each returned paper titles + authors + years + URLs + a one-sentence finding, web-retrieval required (no recall-only citations). -->
19
+
20
+ Three parallel agents, scoped to empirical evidence (not opinion), word-capped, "specificity over breadth — 6–8 well-sourced findings beat 20 vague gestures." Their citations (below) were then resolved against arXiv/Crossref before any informed the design.
21
+
22
+ ## Step 3 — Research grounding
23
+
24
+ 1. **LLMs struggle to self-correct without external feedback, and can degrade after self-correction.** Huang et al. 2023 (arXiv:2310.01798). Implication: the verifier cannot be the generator itself — an external check is required (answers Q1).
25
+ 2. **Autoregressive LLMs cannot self-verify; pair the generator with an external model-based verifier.** Kambhampati et al. 2024 (arXiv:2402.01817). Implication: the architecture is generator + separate verifier, not self-critique (answers Q1).
26
+ 3. **An LLM judge's self-recognition correlates *linearly* with its self-preference bias.** Panickssery, Bowman & Feng 2024 (arXiv:2404.13076). Implication: the verifier must be a *different model family*, since partial blinding of a same-family judge does not remove the bias (answers Q1).
27
+ 4. **18–55% of LLM-generated citations are fabricated, and many real ones carry bibliographic errors.** Walters & Wilder 2023 (doi:10.1038/s41598-023-41032-5). Implication: existence must be established by *retrieval* (resolve the arXiv/DOI), never by the model's recall (answers Q2).
28
+ 5. **Cited links resolve >94% of the time, yet only 39–77% of the content actually supports the claim.** Onweller et al. 2026 (arXiv:2605.06635). Implication: groundedness is a distinct axis from existence — "the link resolves" is not "the paper says this" (answers Q2).
29
+ 6. **Decorrelated verifiers (pairwise ρ ∈ [0.05, 0.25]) beat any single one via submodular coverage.** Rajan 2025 (arXiv:2511.16708). Implication: spend the budget on *lens diversity* (a retrieval oracle + ≥2 different families), not on more copies of one judge (answers Q3).
30
+
31
+ ## Step 4 — External verification
32
+
33
+ <!-- This dispatch's own citations were gated this way before Step 5 was written. -->
34
+
35
+ - [x] every citation resolved by retrieval (arXiv/DOI), not model memory — arXiv API + OpenAlex + Crossref
36
+ - [x] every finding matches what its source actually claims (groundedness) — checked against each abstract
37
+ - [x] >= 3 decorrelated lenses (retrieval oracle + >= 2 different model families) — oracle + Mistral + IBM Granite, reasoning-stripped
38
+
39
+ Result: all six citations VERIFIED (existence + attribution + groundedness). Two blind traps seeded into a sibling set — a misattribution and a fabricated paper — were caught by the *union* of the two families, not either alone.
40
+
41
+ ## Step 5 — Architecture
42
+
43
+ - The verifier is a **different model family** from the synthesizer, run reasoning-stripped. (findings 1, 2, 3)
44
+ - Verification is **two-stage per citation**: a retrieval oracle confirms existence, then a groundedness lens confirms the source supports the claim. (findings 4, 5)
45
+ - The verifier is an **ensemble of decorrelated lenses** (retrieval oracle + ≥2 different families), because diversity — not count — drives coverage. (finding 6)
46
+ - On a non-clean verdict the finding **halts** (fabricated → dropped; misattributed → corrected once; unavailable → escalate), never silently proceeds. (findings 1, 4)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@dogfood-lab/study-swarm",
3
- "version": "0.6.0",
3
+ "version": "1.0.0",
4
4
  "description": "Ground design decisions in cited research, then verify every citation with a different model family before it becomes canon — a research-grounded design protocol, with a thin CLI.",
5
5
  "keywords": [
6
6
  "methodology",
@@ -34,6 +34,7 @@
34
34
  },
35
35
  "files": [
36
36
  "bin/",
37
+ "examples/",
37
38
  "README.md",
38
39
  "README.ja.md",
39
40
  "README.zh.md",