@dogfood-lab/study-swarm 0.6.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +40 -0
- package/PROTOCOL.md +30 -2
- package/README.es.md +68 -31
- package/README.fr.md +69 -32
- package/README.hi.md +67 -30
- package/README.it.md +71 -34
- package/README.ja.md +73 -36
- package/README.md +41 -4
- package/README.pt-BR.md +73 -36
- package/README.zh.md +74 -37
- package/SECURITY.md +6 -6
- package/bin/study-swarm.mjs +176 -48
- package/examples/study-swarm-ci.yml +28 -0
- package/examples/study-swarm-self.dispatch.md +46 -0
- package/package.json +2 -1
package/README.zh.md
CHANGED
|
@@ -13,77 +13,114 @@
|
|
|
13
13
|
<img src="https://img.shields.io/badge/cited%20research-verified-1f6feb" alt="Cited research, verified">
|
|
14
14
|
</p>
|
|
15
15
|
|
|
16
|
-
|
|
16
|
+
**将设计决策建立在引用的研究基础上——然后,在使用*不同的*模型系列验证引用之前,确保其成为正式内容。**
|
|
17
17
|
|
|
18
|
-
`study-swarm
|
|
18
|
+
`study-swarm`是一种协议,而不是一种工具。当您使用LLM做出重大的设计决策时(例如,新的产品层、架构选择或“我们是否应该信任该模型”),从第一性原理出发进行即兴创作会导致过时的设计,而凭记忆引用论文会导致设计依赖于不存在的来源或与您认为的内容不符。`study-swarm`取代了这两种方法:派遣并行研究代理,要求提供具体的引用结果,并在每个引用通过**不同模型系列的外部验证器**后才将其用于指导设计。
|
|
19
19
|
|
|
20
|
-
|
|
20
|
+
它也适用于自身。该协议规定,对于其帮助设计的系统,应使用经过验证器的保护机制——因此,它也在自身上运行这种机制。**没有模型会自己批改作业,包括运行该协议的模型。**
|
|
21
21
|
|
|
22
|
-
##
|
|
22
|
+
## 该协议包含五个步骤:
|
|
23
23
|
|
|
24
|
-
1. **确定** 3-5
|
|
25
|
-
2.
|
|
26
|
-
3.
|
|
27
|
-
4.
|
|
28
|
-
5.
|
|
24
|
+
1. **确定** 3-5个关键设计问题,如果存在经验证据,答案可能会发生变化。
|
|
25
|
+
2. **派遣**每个问题的研究代理,并行进行。每个代理必须返回论文标题+作者+年份+URL+一个句子摘要——强调具体性而非广泛性(“6-8条有充分依据的发现胜过20个模糊的要点”)。
|
|
26
|
+
3. **综合**这些发现,形成一份*研究基础*部分:`N. **<发现>.** <作者> <年份> (<arXiv/DOI>)。 <设计意义>.`
|
|
27
|
+
4. **进行外部验证**——一个*不同的模型系列*(不带推理能力),以两个阶段检查每个引用:一个**检索预言机**确认论文是否存在(永远不是模型的记忆),然后一个**真实性**过滤器确认该发现与来源是否匹配。如果出现捏造/错误归因,则**停止**;如果验证器或检索预言机不可用,则**停止并升级**(切勿将无法访问视为“引用没问题”)。
|
|
28
|
+
5. **将**每个架构选择与编号的发现联系起来。没有设计意义的引用就是噪音。
|
|
29
29
|
|
|
30
|
-
|
|
30
|
+
完整的可执行细节——停止表、来源标准、集成规则——位于**[PROTOCOL.md](PROTOCOL.md)**中。
|
|
31
31
|
|
|
32
|
-
##
|
|
32
|
+
## 为什么需要*不同的*模型系列,并且不带推理能力?
|
|
33
33
|
|
|
34
|
-
|
|
34
|
+
因为其失败模式是已知的,而不是假设的:
|
|
35
35
|
|
|
36
|
-
-
|
|
37
|
-
-
|
|
38
|
-
-
|
|
39
|
-
- **隐藏生成器的推理过程。** Khalifa
|
|
40
|
-
- **多样性胜过数量。** Rajan,2025
|
|
36
|
+
- **LLM无法可靠地验证自身的输出。** Huang等人,2023年([arXiv:2310.01798](https://arxiv.org/abs/2310.01798));Kambhampati等人,2024年([arXiv:2402.01817](https://arxiv.org/abs/2402.01817),LLM-Modulo);Stechly等人,2024年([arXiv:2402.08115](https://arxiv.org/abs/2402.08115))——外部验证器具有优势;自我批评的内容是无效的。
|
|
37
|
+
- **同一系列的评判者会偏袒自己。** Panickssery、Bowman和Feng,2024年([arXiv:2404.13076](https://arxiv.org/abs/2404.13076))——自我识别与自我偏好呈线性相关,因此部分屏蔽没有帮助。Verga等人,2024年([arXiv:2404.18796](https://arxiv.org/abs/2404.18796),PoLL)——跨越不同系列的评审团的偏见更小,成本约为原来的 1/7。
|
|
38
|
+
- **引用是LLM说谎的地方。** Walters和Wilder,2023年([doi:10.1038/s41598-023-41032-5](https://doi.org/10.1038/s41598-023-41032-5))——GPT-3.5的55%和GPT-4的18%的引用是捏造的。Onweller等人,2026年([arXiv:2605.06635](https://arxiv.org/abs/2605.06635))——链接可以解决超过94%的问题,但只有39-77%的引用内容实际上支持该声明。因此,必须通过**检索而不是回忆**来检查是否存在。
|
|
39
|
+
- **隐藏生成器的推理过程。** Khalifa等人,2026年([arXiv:2601.14691](https://arxiv.org/abs/2601.14691),“欺骗评判者”)——仅通过操纵思维链,可以将评判者的假阳性率提高高达90%,同时保持操作不变。Turpin等人,2023年([arXiv:2305.04388](https://arxiv.org/abs/2305.04388))——思维链是一种事后合理化。验证器只会看到裸露的引用声明,而不会看到“我为什么包含这个”。
|
|
40
|
+
- **多样性胜过数量。** Rajan,2025年([arXiv:2511.16708](https://arxiv.org/abs/2511.16708))——四个验证器在成对相关系数ρ∈[0.05, 0.25]时,通过亚模覆盖优于任何单个验证器。Kim等人,2025年([arXiv:2506.07962](https://arxiv.org/abs/2506.07962))——LLM的错误是*相关的*,因此关键变量是视角多样性,而不是原始数量。
|
|
41
41
|
|
|
42
42
|
## 它真的有效吗?(证明)
|
|
43
43
|
|
|
44
|
-
|
|
44
|
+
作为测试,该协议针对其自身的引用进行了运行。两个不相关的非Claude系列——**Mistral** (`mistral-small:24b`)和**IBM Granite** (`granite4.1:30b`)——以无推理的方式检查了一组引用,并设置了两个盲目陷阱:
|
|
45
45
|
|
|
46
|
-
|
|
|
46
|
+
| 设定的陷阱 | Mistral | IBM Granite | 真实情况 |
|
|
47
47
|
|---|---|---|---|
|
|
48
|
-
| 思维链提示归因于“Nakamura & Olsen” | 未发现 |
|
|
49
|
-
|
|
|
48
|
+
| 思维链提示归因于“Nakamura & Olsen” | 未发现 | **发现**(错误归因→实际上是Wei等人,2022年,arXiv:2201.11903) | 错误归因 |
|
|
49
|
+
| 一篇捏造的“98%的错误已消除,无需预言机”论文 | **caught** (fabricated) | **caught** (fabricated) | 捏造 |
|
|
50
50
|
|
|
51
|
-
|
|
51
|
+
两个系列都没有单独发现这两个陷阱——但它们的**组合发现了2/2个。** 单个评判者会忽略错误归因。此外,检索预言机还发现了我们自己设计文档中的两个*真实*的错误归因(引用了错误的作者),而任何参数LLM都无法标记出来——并且它正确地确认了真正的2026年的论文,这两个LLM因为该论文发布时间晚于它们的训练数据而将其错误地标记为捏造。后一点是第4步存在性检查**必须**使用检索预言机的原因,而不是LLM。
|
|
52
52
|
|
|
53
|
-
|
|
53
|
+
这次运行就是缩影:**不相关的视角+用于验证存在的检索预言机胜过任何一个聪明的评判者。**
|
|
54
54
|
|
|
55
|
-
##
|
|
55
|
+
## 它的工作原理
|
|
56
56
|
|
|
57
|
-
|
|
57
|
+
你可以手动运行该协议——任何不同的模型加上自行解析 arXiv/DOI,即可满足步骤 4。两个辅助工具使其成为一个命令:
|
|
58
58
|
|
|
59
|
-
- **[prism-verify](https://github.com/mcp-tool-shop-org/prism-verify)**
|
|
60
|
-
- **[role-os](https://github.com/mcp-tool-shop-org/role-os)**
|
|
59
|
+
- **[prism-verify](https://github.com/mcp-tool-shop-org/prism-verify)** ——运行时验证器:不同模型的路由、去除推理过程的多镜头仲裁、确定性的检索存在性基准(arXiv → Crossref)以及带签名的收据。
|
|
60
|
+
- **[role-os](https://github.com/mcp-tool-shop-org/role-os)** ——提供 `roleos verify-citations <dispatch>`,该工具提取一个“dispatch”中的引用并将其传递给 prism 进行验证。
|
|
61
61
|
|
|
62
|
-
|
|
62
|
+
数据传输是“dispatch”格式本身:以 `N. **finding.** Authors year (arXiv|DOI). implication.` 形式编写的发现——每个发现都**包含一个可解析的标识符**——这正是 `roleos verify-citations` 工具所处理和验证的内容。如果“dispatch”符合 `lint` 的要求,则可以顺利进行;如果引用格式不正确,该工具会将其标记为未解析。`study-swarm lint` 会在本地检查此约定,因此步骤 3 和步骤 4 对引用的定义保持一致。
|
|
63
|
+
|
|
64
|
+
## 命令行界面 (CLI)
|
|
63
65
|
|
|
64
66
|
```bash
|
|
65
67
|
npm i -g @dogfood-lab/study-swarm # or run ad-hoc: npx @dogfood-lab/study-swarm <command>
|
|
66
68
|
```
|
|
67
69
|
|
|
68
|
-
| 命令 |
|
|
70
|
+
| 命令 | 作用 |
|
|
69
71
|
|---|---|
|
|
70
|
-
| `study-swarm protocol` |
|
|
71
|
-
| `study-swarm new <slug>` |
|
|
72
|
-
| `study-swarm lint <
|
|
72
|
+
| `study-swarm protocol` | 打印完整的协议——五个步骤、停止表和来源标准。 |
|
|
73
|
+
| `study-swarm new <slug>` | 创建一个 `<slug>.dispatch.md` 文件,其中包含五步流程的框架,以便进行填充。 |
|
|
74
|
+
| `study-swarm lint [--json] <path…>` | 检查“dispatch”的*研究依据*是否符合来源标准——每个发现都需要作者、年份和一个可解析的标识符(arXiv / DOI / URL);“研究表明……”这种泛泛而谈的方式将被拒绝。如果存在违规行为,则返回 `1`,从而阻止 CI 流程。`<path>` 可以是文件、目录(递归地对所有 `*.dispatch.md` 文件进行 lint 检查),或者 `-` 表示标准输入;`--json` 会输出机器可读的报告。 |
|
|
75
|
+
|
|
76
|
+
`lint` 是确定性的——不调用任何模型——因此可以在 CI 中安全使用。它在本地强制执行**步骤 3 的来源标准**;基于模型的**步骤 4** 验证仍然依赖于 [`roleos verify-citations`](https://github.com/mcp-tool-shop-org/role-os) → prism。
|
|
73
77
|
|
|
74
|
-
|
|
78
|
+
典型的流程:
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
study-swarm new my-decision # creates my-decision.dispatch.md
|
|
82
|
+
# …fill in the questions, run the research dispatch, write the findings…
|
|
83
|
+
study-swarm lint my-decision.dispatch.md # enforce the sourcing standard (Step 3)
|
|
84
|
+
roleos verify-citations my-decision.dispatch.md # model-based Step 4 (different family, via prism)
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
一个完整的、符合 `lint` 要求的“dispatch”——将 study-swarm 应用于其自身的设计——包含在 [`examples/study-swarm-self.dispatch.md`](examples/study-swarm-self.dispatch.md) 中,作为参考示例。
|
|
88
|
+
|
|
89
|
+
### 在 CI 中进行验证
|
|
90
|
+
|
|
91
|
+
`lint` 接受文件、目录(递归地对所有 `*.dispatch.md` 文件进行 lint 检查),或者 `-` 表示标准输入,并且 `--json` 会输出机器可读的报告。将其添加到你的仓库中,以便在每次 PR 中验证每个“dispatch”的来源(一个复制粘贴示例也包含在 [`examples/study-swarm-ci.yml`](examples/study-swarm-ci.yml) 中):
|
|
92
|
+
|
|
93
|
+
```yaml
|
|
94
|
+
# .github/workflows/dispatches.yml
|
|
95
|
+
name: study-swarm lint
|
|
96
|
+
on:
|
|
97
|
+
pull_request:
|
|
98
|
+
paths: ['**/*.dispatch.md', '.github/workflows/dispatches.yml']
|
|
99
|
+
workflow_dispatch:
|
|
100
|
+
concurrency:
|
|
101
|
+
group: ${{ github.workflow }}-${{ github.ref }}
|
|
102
|
+
cancel-in-progress: true
|
|
103
|
+
jobs:
|
|
104
|
+
lint:
|
|
105
|
+
runs-on: ubuntu-latest
|
|
106
|
+
steps:
|
|
107
|
+
- uses: actions/checkout@v4
|
|
108
|
+
- uses: actions/setup-node@v4
|
|
109
|
+
with: { node-version: '20' }
|
|
110
|
+
- run: npx @dogfood-lab/study-swarm@latest lint dispatches/
|
|
111
|
+
```
|
|
75
112
|
|
|
76
|
-
##
|
|
113
|
+
## 为什么它有效,一句话概括:
|
|
77
114
|
|
|
78
|
-
|
|
115
|
+
**当前**——该领域发展迅速;要求提供具体的、带有年份的研究,可以防止设计落后 18 个月。**功能性**——证据表明哪些*失败*了,而不仅仅是哪些有效(解释可能会增加对*错误* AI 的过度依赖——Bansal 等人,2021 年,[arXiv:2006.14779](https://arxiv.org/abs/2006.14779))。**安全性**——由验证器保护的范围是证据支持的架构,并且该协议对其自身输出进行强制执行。来源不是学术游戏;它是证据链。
|
|
79
116
|
|
|
80
117
|
## 安全性
|
|
81
118
|
|
|
82
|
-
`study-swarm`
|
|
119
|
+
`study-swarm` 提供一个**轻量级、零依赖 CLI** (`study-swarm`) 以及该方法论。它**不进行任何网络或模型调用,也不收集任何遥测数据**;源代码中没有秘密或凭据。在运行时,它只会读取你传递给 `lint` 的文件,并在当前目录中写入一个 `<slug>.dispatch.md` 文件(拒绝覆盖,并且绝不会超出工作目录)。该方法论描述的基于模型的验证(步骤 4)由辅助工具执行,而不是由此软件包执行。请参阅 [SECURITY.md](SECURITY.md)。
|
|
83
120
|
|
|
84
121
|
## 状态
|
|
85
122
|
|
|
86
|
-
|
|
123
|
+
一个可工作的协议,通过其自身的机制进行了外部验证——不同的模型会检查其引用(参见上面的证明)。该仓库是公共参考;[PROTOCOL.md](PROTOCOL.md) 是可执行的形式。它是 [dogfood-lab](https://github.com/dogfood-lab) 系列的一部分——用于构建 AI 时代的方法和示例。
|
|
87
124
|
|
|
88
125
|
采用 MIT 许可。
|
|
89
126
|
|
package/SECURITY.md
CHANGED
|
@@ -1,15 +1,15 @@
|
|
|
1
1
|
# Security Policy
|
|
2
2
|
|
|
3
|
-
`study-swarm` is a **
|
|
3
|
+
`study-swarm` is the study-swarm methodology (Markdown) plus a **thin, zero-dependency command-line tool**, published as the npm package `@dogfood-lab/study-swarm`. The CLI ships in the package (`bin/study-swarm.mjs`), so installing it exposes a `study-swarm` executable. It has **no runtime dependencies** and makes **no network or model calls** — the model-based verification the methodology describes (Step 4) is run by separate tools, not by this package.
|
|
4
4
|
|
|
5
5
|
## Threat model
|
|
6
6
|
|
|
7
|
-
- **What it
|
|
8
|
-
- **What it does NOT
|
|
9
|
-
- **
|
|
10
|
-
- **Permissions required:**
|
|
7
|
+
- **What it runs:** a small Node CLI (Node >= 18). `protocol`, `version`, and `help` only print text. `lint <file>` **reads** the file you name. `new <slug>` **writes** exactly one file — `<slug>.dispatch.md` — in the current working directory, and refuses to overwrite an existing file. The slug is sanitized to a single filename (path separators are replaced with `-`, pure-dots slugs rejected), so `new` cannot write outside the current directory.
|
|
8
|
+
- **What it does NOT do:** no network access, no model calls, no telemetry, no filesystem access beyond the two cases above, no use of credentials or environment beyond what Node needs to run.
|
|
9
|
+
- **Secrets/credentials:** none in source or output.
|
|
10
|
+
- **Permissions required:** filesystem read for `lint`; one-file write (in the working directory) for `new`. Nothing else.
|
|
11
11
|
|
|
12
|
-
The methodology *describes* a workflow that uses web retrieval and model-based verification
|
|
12
|
+
The methodology *describes* a workflow that uses web retrieval and model-based verification; those are performed by the sibling tools ([prism-verify](https://github.com/mcp-tool-shop-org/prism-verify), [role-os](https://github.com/mcp-tool-shop-org/role-os)), not by this package.
|
|
13
13
|
|
|
14
14
|
## Supported versions
|
|
15
15
|
|
package/bin/study-swarm.mjs
CHANGED
|
@@ -1,13 +1,15 @@
|
|
|
1
1
|
#!/usr/bin/env node
|
|
2
2
|
// study-swarm — thin CLI for the research-grounded design protocol.
|
|
3
3
|
// Zero runtime dependencies. Commands: protocol | new | lint | help | version.
|
|
4
|
-
import { readFileSync, writeFileSync, existsSync } from 'node:fs';
|
|
4
|
+
import { readFileSync, writeFileSync, existsSync, statSync, readdirSync } from 'node:fs';
|
|
5
5
|
import { fileURLToPath } from 'node:url';
|
|
6
|
-
import { dirname, resolve } from 'node:path';
|
|
6
|
+
import { dirname, resolve, join } from 'node:path';
|
|
7
|
+
import { createHash } from 'node:crypto';
|
|
7
8
|
|
|
8
9
|
const __dirname = dirname(fileURLToPath(import.meta.url));
|
|
9
10
|
const PKG = JSON.parse(readFileSync(resolve(__dirname, '../package.json'), 'utf8'));
|
|
10
11
|
const VERSION = PKG.version;
|
|
12
|
+
const PROTOCOL_PATH = resolve(__dirname, '../PROTOCOL.md');
|
|
11
13
|
|
|
12
14
|
const HELP = `study-swarm v${VERSION} — ground design decisions in cited research, then verify.
|
|
13
15
|
|
|
@@ -15,17 +17,24 @@ USAGE
|
|
|
15
17
|
study-swarm <command> [args]
|
|
16
18
|
|
|
17
19
|
COMMANDS
|
|
18
|
-
protocol
|
|
19
|
-
new <slug>
|
|
20
|
-
lint <
|
|
21
|
-
|
|
22
|
-
|
|
20
|
+
protocol Print the locked protocol (the five steps + halt rules).
|
|
21
|
+
new <slug> Scaffold a dispatch file <slug>.dispatch.md to fill in.
|
|
22
|
+
lint [--json] <path...> Check dispatches' citations against the sourcing standard.
|
|
23
|
+
A <path> may be a file, a directory (linted recursively for
|
|
24
|
+
*.dispatch.md), or "-" to read one dispatch from stdin.
|
|
25
|
+
help Show this help.
|
|
26
|
+
version Print the version.
|
|
23
27
|
|
|
24
28
|
EXIT CODES
|
|
25
29
|
0 ok / lint clean
|
|
26
30
|
1 lint found sourcing violations
|
|
27
31
|
2 usage or runtime error
|
|
28
32
|
|
|
33
|
+
NOTE
|
|
34
|
+
lint checks citation FORM (Step 3: author + year + a resolvable arXiv/DOI/URL,
|
|
35
|
+
no "studies show…" gestures) — it does not judge whether a source is legitimate
|
|
36
|
+
or actually supports the claim. That is Step 4, below.
|
|
37
|
+
|
|
29
38
|
Run a dispatch's model-based verification with: roleos verify-citations <file>
|
|
30
39
|
Docs: https://dogfood-lab.github.io/study-swarm/
|
|
31
40
|
`;
|
|
@@ -35,19 +44,29 @@ function fail(code, msg) {
|
|
|
35
44
|
process.exit(code);
|
|
36
45
|
}
|
|
37
46
|
|
|
47
|
+
// Short hash of the vendored PROTOCOL.md, so a scaffolded dispatch records the exact
|
|
48
|
+
// methodology version it was authored against (the package vendors PROTOCOL.md for this).
|
|
49
|
+
function protocolHash() {
|
|
50
|
+
try { return createHash('sha256').update(readFileSync(PROTOCOL_PATH)).digest('hex').slice(0, 16); }
|
|
51
|
+
catch { return 'unknown'; }
|
|
52
|
+
}
|
|
53
|
+
|
|
38
54
|
function cmdProtocol() {
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
55
|
+
if (!existsSync(PROTOCOL_PATH)) fail(2, 'PROTOCOL.md not found in package');
|
|
56
|
+
try { process.stdout.write(readFileSync(PROTOCOL_PATH, 'utf8')); }
|
|
57
|
+
catch (err) { fail(2, `cannot read PROTOCOL.md in package: ${err && err.code ? err.code : err.message}`); }
|
|
42
58
|
}
|
|
43
59
|
|
|
44
|
-
const template = (slug) =>
|
|
60
|
+
const template = (slug, stamp) => `<!-- ${stamp} -->
|
|
61
|
+
# Study-swarm dispatch: ${slug}
|
|
45
62
|
|
|
46
63
|
> Fill in each section. Verify citations (Step 4) BEFORE connecting findings to the design (Step 5).
|
|
47
64
|
> Lint the sourcing with: study-swarm lint ${slug}.dispatch.md
|
|
48
65
|
|
|
49
66
|
## Step 1 — Load-bearing questions
|
|
50
|
-
<!-- 3-5 questions where empirical evidence would change the answer. Fewer is fine if the decision is substantial.
|
|
67
|
+
<!-- 3-5 questions where empirical evidence would change the answer. Fewer is fine if the decision is substantial.
|
|
68
|
+
A question is load-bearing if you can picture two designs hinging on the answer and the honest current
|
|
69
|
+
answer is "I think…", not "evidence says…". Don't manufacture questions to hit a count. -->
|
|
51
70
|
1.
|
|
52
71
|
2.
|
|
53
72
|
3.
|
|
@@ -57,7 +76,8 @@ const template = (slug) => `# Study-swarm dispatch: ${slug}
|
|
|
57
76
|
|
|
58
77
|
## Step 3 — Research grounding
|
|
59
78
|
<!-- One entry per finding (this is what 'lint' checks):
|
|
60
|
-
N. **<finding>.** <Authors> <year> (<arXiv:NNNN.NNNNN | DOI>). <design implication>.
|
|
79
|
+
N. **<finding>.** <Authors> <year> (<arXiv:NNNN.NNNNN | DOI>). <design implication>.
|
|
80
|
+
e.g.: 1. **Contrastive explanations with a predicted human foil improve independent decisions.** Buçinca et al. 2024 (arXiv:2410.04253). Implication: every recommendation carries a "you might think X; I chose Y because…" frame. -->
|
|
61
81
|
1. **<finding>.** <Authors> <year> (arXiv:____.____). <implication>.
|
|
62
82
|
|
|
63
83
|
## Step 4 — External verification
|
|
@@ -73,60 +93,168 @@ const template = (slug) => `# Study-swarm dispatch: ${slug}
|
|
|
73
93
|
|
|
74
94
|
function cmdNew(slug) {
|
|
75
95
|
if (!slug) fail(2, 'usage: study-swarm new <slug>');
|
|
76
|
-
|
|
96
|
+
// Reduce the slug to a single safe filename: strip any trailing .dispatch.md (even if
|
|
97
|
+
// repeated), then collapse anything that isn't a word char, dot, or hyphen to '-'. Path
|
|
98
|
+
// separators ('/' and '\') are NOT permitted — `new` writes ONE file in the current
|
|
99
|
+
// directory and must never traverse out of it. A pure-dots slug ('.', '..') is rejected.
|
|
100
|
+
const stem = String(slug).replace(/(\.dispatch\.md)+$/i, '');
|
|
101
|
+
const safe = stem.replace(/[^\w.\-]/g, '-');
|
|
102
|
+
if (!safe || /^\.+$/.test(safe)) {
|
|
103
|
+
fail(2, `invalid slug "${slug}" — use letters, digits, '.', or '-' (the file stays in the current directory)`);
|
|
104
|
+
}
|
|
77
105
|
const out = `${safe}.dispatch.md`;
|
|
78
106
|
if (existsSync(out)) fail(2, `refusing to overwrite existing ${out}`);
|
|
79
|
-
|
|
80
|
-
|
|
107
|
+
// Provenance stamp: pins the methodology version a dispatch was authored against.
|
|
108
|
+
const stamp = `study-swarm v${VERSION} · protocol-sha256:${protocolHash()} · created:${new Date().toISOString().slice(0, 10)}`;
|
|
109
|
+
writeFileSync(out, template(safe, stamp), 'utf8');
|
|
110
|
+
const note = safe === stem ? '' : ` (slug sanitized to "${safe}")`;
|
|
111
|
+
process.stdout.write(`Created ${out}${note}\nFill it in, then: study-swarm lint ${out}\n`);
|
|
81
112
|
}
|
|
82
113
|
|
|
83
|
-
|
|
84
|
-
if (!file) fail(2, 'usage: study-swarm lint <file>');
|
|
85
|
-
if (!existsSync(file)) fail(2, `file not found: ${file}`);
|
|
86
|
-
const lines = readFileSync(file, 'utf8').split(/\r?\n/);
|
|
114
|
+
// --- lint core ------------------------------------------------------------
|
|
87
115
|
|
|
88
|
-
|
|
89
|
-
|
|
116
|
+
const YEAR = /\b(19|20)\d{2}\b/;
|
|
117
|
+
const ID = /(arxiv:\s*\d{4}\.\d{4,5}|10\.\d{4,9}\/\S+|https?:\/\/\S+)/i;
|
|
118
|
+
const PLACEHOLDER = /arXiv:_{2,}|<finding>|<authors>|<year>|<implication>/i;
|
|
119
|
+
const BANNED = /\b(studies show|research suggests|it'?s well[- ]established|well[- ]established that)\b/i;
|
|
120
|
+
// An author cite: a capitalized name (Unicode-aware, so "Buçinca" counts), optionally
|
|
121
|
+
// followed by "et al.", "&", "and", or further surnames, immediately before the year.
|
|
122
|
+
// Accepts "Huang et al. 2023", "Walters & Wilder 2023", "Panickssery, Bowman & Feng 2024";
|
|
123
|
+
// flags an author-less finding like "**Foo.** 2024 (arXiv:…)".
|
|
124
|
+
const AUTHOR = /\p{Lu}[\p{L}.'’-]+(?:\s*,?\s*(?:&|and|et al\.?|\p{Lu}[\p{L}.'’-]+))*\s+\(?(?:19|20)\d{2}/u;
|
|
125
|
+
|
|
126
|
+
// Check one dispatch's text. Returns a structured result; never exits.
|
|
127
|
+
function lintText(label, raw) {
|
|
128
|
+
const lines = raw.split(/\r?\n/);
|
|
129
|
+
const problems = []; // { finding, line, rule, message }
|
|
130
|
+
const add = (rule, message, line = null, finding = null) => problems.push({ finding, line, rule, message });
|
|
131
|
+
|
|
132
|
+
// Find the "Research grounding" heading whose TEXT ends with that phrase (last wins), so a
|
|
133
|
+
// title that merely mentions "research grounding" above the real section can't shadow it.
|
|
134
|
+
let start = -1;
|
|
135
|
+
for (let i = 0; i < lines.length; i++) {
|
|
136
|
+
const h = lines[i].match(/^#{1,6}\s+(.*?)\s*$/);
|
|
137
|
+
if (h && /research grounding$/i.test(h[1])) start = i;
|
|
138
|
+
}
|
|
139
|
+
if (start === -1) {
|
|
140
|
+
add('no-section', 'no "Research grounding" section found — every dispatch needs one (Step 3).');
|
|
141
|
+
return { file: label, ok: false, findingCount: 0, problems, findings: [] };
|
|
142
|
+
}
|
|
90
143
|
let end = lines.length;
|
|
91
144
|
for (let i = start + 1; i < lines.length; i++) {
|
|
92
145
|
if (/^#{1,6}\s/.test(lines[i])) { end = i; break; }
|
|
93
146
|
}
|
|
94
147
|
const section = lines.slice(start + 1, end);
|
|
95
148
|
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
const
|
|
99
|
-
const BANNED = /\b(studies show|research suggests|it'?s well[- ]established|well[- ]established that)\b/i;
|
|
100
|
-
|
|
101
|
-
// Each numbered list item (with its continuation lines) is one finding.
|
|
102
|
-
const findings = [];
|
|
149
|
+
// Split into findings (numbered items + continuation lines), ignoring fenced code blocks
|
|
150
|
+
// so a "1." inside a ``` example isn't mistaken for a finding. Track each finding's line.
|
|
151
|
+
const findings = []; // { text, line }
|
|
103
152
|
let cur = null;
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
153
|
+
let inFence = false;
|
|
154
|
+
section.forEach((l, idx) => {
|
|
155
|
+
if (/^\s*(```|~~~)/.test(l)) { inFence = !inFence; return; }
|
|
156
|
+
if (inFence) return;
|
|
157
|
+
if (/^\s*\d+\.\s/.test(l)) { if (cur) findings.push(cur); cur = { text: l, line: start + 2 + idx }; }
|
|
158
|
+
else if (cur && l.trim()) cur.text += ' ' + l.trim();
|
|
159
|
+
});
|
|
160
|
+
if (cur) findings.push(cur);
|
|
161
|
+
|
|
162
|
+
if (findings.length === 0) add('no-findings', 'Research grounding has no numbered findings.');
|
|
109
163
|
|
|
110
|
-
const
|
|
111
|
-
if (findings.length === 0) problems.push('Research grounding has no numbered findings.');
|
|
164
|
+
const parsed = [];
|
|
112
165
|
findings.forEach((f, i) => {
|
|
113
166
|
const n = i + 1;
|
|
114
|
-
if (PLACEHOLDER.test(f))
|
|
115
|
-
|
|
116
|
-
|
|
167
|
+
if (PLACEHOLDER.test(f.text)) add('placeholder', `finding ${n}: still has template placeholders — fill it in.`, f.line, n);
|
|
168
|
+
// Strip identifiers before the year check so an arXiv id's YYMM prefix
|
|
169
|
+
// (e.g. 2402 in arXiv:2402.01817) can't masquerade as a publication year.
|
|
170
|
+
const fNoIds = f.text.replace(/arxiv:\s*\d{4}\.\d{4,5}/gi, '').replace(/10\.\d{4,9}\/\S+/g, '');
|
|
171
|
+
if (!YEAR.test(fNoIds)) add('missing-year', `finding ${n}: missing a year (spell it out, e.g. "2024" — an arXiv id alone is not a year).`, f.line, n);
|
|
172
|
+
if (!AUTHOR.test(f.text)) add('missing-author', `finding ${n}: missing an author before the year (e.g. "Huang et al. 2023").`, f.line, n);
|
|
173
|
+
const idm = f.text.match(ID);
|
|
174
|
+
if (!idm) add('missing-id', `finding ${n}: missing an identifier (arXiv:NNNN.NNNNN, DOI, or URL).`, f.line, n);
|
|
175
|
+
const ym = fNoIds.match(YEAR);
|
|
176
|
+
const ident = idm ? idm[0].replace(/\s+/g, '').replace(/[).,;]+$/, '') : null;
|
|
177
|
+
parsed.push({ finding: n, year: ym ? ym[0] : null, identifier: ident });
|
|
117
178
|
});
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
179
|
+
|
|
180
|
+
// Banned gesture anywhere in the section (outside fences): a finding STATES its result,
|
|
181
|
+
// it never "studies show…" — a co-located citation doesn't redeem it.
|
|
182
|
+
let fence = false;
|
|
183
|
+
section.forEach((l, idx) => {
|
|
184
|
+
if (/^\s*(```|~~~)/.test(l)) { fence = !fence; return; }
|
|
185
|
+
if (!fence && BANNED.test(l)) {
|
|
186
|
+
add('banned-gesture', `line ${start + 2 + idx}: name the study (author + year + identifier), don't gesture: "${l.trim().slice(0, 56)}"`, start + 2 + idx);
|
|
121
187
|
}
|
|
122
188
|
});
|
|
123
189
|
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
190
|
+
return { file: label, ok: problems.length === 0, findingCount: findings.length, problems, findings: parsed };
|
|
191
|
+
}
|
|
192
|
+
|
|
193
|
+
// Recursively collect *.dispatch.md files under a directory (skips node_modules/.git).
|
|
194
|
+
function walkDispatches(dir) {
|
|
195
|
+
const out = [];
|
|
196
|
+
for (const entry of readdirSync(dir, { withFileTypes: true })) {
|
|
197
|
+
if (entry.name === 'node_modules' || entry.name === '.git') continue;
|
|
198
|
+
const full = join(dir, entry.name);
|
|
199
|
+
if (entry.isDirectory()) out.push(...walkDispatches(full));
|
|
200
|
+
else if (/\.dispatch\.md$/i.test(entry.name)) out.push(full);
|
|
201
|
+
}
|
|
202
|
+
return out.sort();
|
|
203
|
+
}
|
|
204
|
+
|
|
205
|
+
function readTarget(p) {
|
|
206
|
+
try { return { label: p, raw: readFileSync(p, 'utf8') }; }
|
|
207
|
+
catch (err) { fail(2, `cannot read ${p}: ${err && err.code ? err.code : err.message}`); }
|
|
208
|
+
}
|
|
209
|
+
|
|
210
|
+
function cmdLint(args) {
|
|
211
|
+
const json = args.includes('--json');
|
|
212
|
+
const paths = args.filter((a) => a !== '--json');
|
|
213
|
+
if (paths.length === 0) fail(2, 'usage: study-swarm lint [--json] <file|dir|-> [more...]');
|
|
214
|
+
|
|
215
|
+
const targets = [];
|
|
216
|
+
for (const p of paths) {
|
|
217
|
+
if (p === '-') {
|
|
218
|
+
let raw;
|
|
219
|
+
try { raw = readFileSync(0, 'utf8'); }
|
|
220
|
+
catch (err) { fail(2, `cannot read stdin: ${err && err.code ? err.code : err.message}`); }
|
|
221
|
+
targets.push({ label: '<stdin>', raw });
|
|
222
|
+
continue;
|
|
223
|
+
}
|
|
224
|
+
if (!existsSync(p)) fail(2, `path not found: ${p}`);
|
|
225
|
+
if (statSync(p).isDirectory()) {
|
|
226
|
+
const files = walkDispatches(p);
|
|
227
|
+
if (files.length === 0) fail(2, `no .dispatch.md files found under ${p}`);
|
|
228
|
+
for (const f of files) targets.push(readTarget(f));
|
|
229
|
+
} else {
|
|
230
|
+
targets.push(readTarget(p));
|
|
231
|
+
}
|
|
232
|
+
}
|
|
233
|
+
|
|
234
|
+
const results = targets.map((t) => lintText(t.label, t.raw));
|
|
235
|
+
const anyFail = results.some((r) => !r.ok);
|
|
236
|
+
|
|
237
|
+
if (json) {
|
|
238
|
+
const payload = results.length === 1 ? results[0] : { ok: !anyFail, files: results };
|
|
239
|
+
process.stdout.write(JSON.stringify(payload) + '\n');
|
|
240
|
+
process.exit(anyFail ? 1 : 0);
|
|
241
|
+
}
|
|
242
|
+
|
|
243
|
+
for (const r of results) {
|
|
244
|
+
if (r.ok) {
|
|
245
|
+
process.stdout.write(`ok ${r.file}: ${r.findingCount} finding(s), all sourced.\n`);
|
|
246
|
+
} else {
|
|
247
|
+
process.stderr.write(`x ${r.file}: ${r.problems.length} sourcing issue(s)\n`);
|
|
248
|
+
for (const pr of r.problems) process.stderr.write(` - ${pr.message}\n`);
|
|
249
|
+
}
|
|
250
|
+
}
|
|
251
|
+
if (!anyFail) {
|
|
252
|
+
process.stdout.write(
|
|
253
|
+
`\nStep 3 (sourcing FORM) is satisfied — this does NOT confirm the citations exist or support the claim.\n` +
|
|
254
|
+
`Run Step 4 (existence + groundedness, a different model family): roleos verify-citations <file>\n`,
|
|
255
|
+
);
|
|
128
256
|
}
|
|
129
|
-
process.
|
|
257
|
+
process.exit(anyFail ? 1 : 0);
|
|
130
258
|
}
|
|
131
259
|
|
|
132
260
|
function main(argv) {
|
|
@@ -134,7 +262,7 @@ function main(argv) {
|
|
|
134
262
|
switch (cmd) {
|
|
135
263
|
case 'protocol': return cmdProtocol();
|
|
136
264
|
case 'new': return cmdNew(rest[0]);
|
|
137
|
-
case 'lint': return cmdLint(rest
|
|
265
|
+
case 'lint': return cmdLint(rest);
|
|
138
266
|
case 'version': case '--version': case '-v':
|
|
139
267
|
return void process.stdout.write(VERSION + '\n');
|
|
140
268
|
case 'help': case '--help': case '-h': case undefined:
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# Copy this into YOUR repo at .github/workflows/dispatches.yml to gate the sourcing
|
|
2
|
+
# of every study-swarm dispatch on each pull request. It is a SAMPLE — it is not an
|
|
3
|
+
# active workflow in the study-swarm repo itself.
|
|
4
|
+
name: study-swarm lint
|
|
5
|
+
|
|
6
|
+
on:
|
|
7
|
+
pull_request:
|
|
8
|
+
paths:
|
|
9
|
+
- '**/*.dispatch.md'
|
|
10
|
+
- '.github/workflows/dispatches.yml'
|
|
11
|
+
workflow_dispatch:
|
|
12
|
+
|
|
13
|
+
concurrency:
|
|
14
|
+
group: ${{ github.workflow }}-${{ github.ref }}
|
|
15
|
+
cancel-in-progress: true
|
|
16
|
+
|
|
17
|
+
jobs:
|
|
18
|
+
lint:
|
|
19
|
+
runs-on: ubuntu-latest
|
|
20
|
+
timeout-minutes: 5
|
|
21
|
+
steps:
|
|
22
|
+
- uses: actions/checkout@v4
|
|
23
|
+
- uses: actions/setup-node@v4
|
|
24
|
+
with:
|
|
25
|
+
node-version: '20'
|
|
26
|
+
# Lint every dispatch under dispatches/ (a file, a dir, or '-' for stdin all work).
|
|
27
|
+
# Exit 1 on any sourcing violation fails the check. Add --json for machine-readable output.
|
|
28
|
+
- run: npx @dogfood-lab/study-swarm@latest lint dispatches/
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
<!-- study-swarm vX.Y.Z · protocol-sha256:<vendored> · a worked, lint-clean reference dispatch -->
|
|
2
|
+
# Study-swarm dispatch: study-swarm-self
|
|
3
|
+
|
|
4
|
+
> A complete, **lint-clean** example dispatch — study-swarm applied to its own
|
|
5
|
+
> central design decision. Run `study-swarm lint examples/study-swarm-self.dispatch.md`
|
|
6
|
+
> (it passes), then read it as a model for what a filled-in dispatch looks like end to end.
|
|
7
|
+
|
|
8
|
+
## Step 1 — Load-bearing questions
|
|
9
|
+
|
|
10
|
+
<!-- Each is load-bearing: two real designs hinge on the answer, and the honest prior is "I think", not "evidence says". -->
|
|
11
|
+
|
|
12
|
+
1. When an LLM makes a substantial design call, can the *same* model reliably verify its own citations, or does the verifier have to be a separate model?
|
|
13
|
+
2. Is confirming a cited paper *exists* enough, or must "the source supports this claim" be checked as a separate axis?
|
|
14
|
+
3. Does adding *more* verifiers improve coverage, or does the diversity of the verifiers matter more than their count?
|
|
15
|
+
|
|
16
|
+
## Step 2 — Research dispatch
|
|
17
|
+
|
|
18
|
+
<!-- One research agent per question, in parallel; each returned paper titles + authors + years + URLs + a one-sentence finding, web-retrieval required (no recall-only citations). -->
|
|
19
|
+
|
|
20
|
+
Three parallel agents, scoped to empirical evidence (not opinion), word-capped, "specificity over breadth — 6–8 well-sourced findings beat 20 vague gestures." Their citations (below) were then resolved against arXiv/Crossref before any informed the design.
|
|
21
|
+
|
|
22
|
+
## Step 3 — Research grounding
|
|
23
|
+
|
|
24
|
+
1. **LLMs struggle to self-correct without external feedback, and can degrade after self-correction.** Huang et al. 2023 (arXiv:2310.01798). Implication: the verifier cannot be the generator itself — an external check is required (answers Q1).
|
|
25
|
+
2. **Autoregressive LLMs cannot self-verify; pair the generator with an external model-based verifier.** Kambhampati et al. 2024 (arXiv:2402.01817). Implication: the architecture is generator + separate verifier, not self-critique (answers Q1).
|
|
26
|
+
3. **An LLM judge's self-recognition correlates *linearly* with its self-preference bias.** Panickssery, Bowman & Feng 2024 (arXiv:2404.13076). Implication: the verifier must be a *different model family*, since partial blinding of a same-family judge does not remove the bias (answers Q1).
|
|
27
|
+
4. **18–55% of LLM-generated citations are fabricated, and many real ones carry bibliographic errors.** Walters & Wilder 2023 (doi:10.1038/s41598-023-41032-5). Implication: existence must be established by *retrieval* (resolve the arXiv/DOI), never by the model's recall (answers Q2).
|
|
28
|
+
5. **Cited links resolve >94% of the time, yet only 39–77% of the content actually supports the claim.** Onweller et al. 2026 (arXiv:2605.06635). Implication: groundedness is a distinct axis from existence — "the link resolves" is not "the paper says this" (answers Q2).
|
|
29
|
+
6. **Decorrelated verifiers (pairwise ρ ∈ [0.05, 0.25]) beat any single one via submodular coverage.** Rajan 2025 (arXiv:2511.16708). Implication: spend the budget on *lens diversity* (a retrieval oracle + ≥2 different families), not on more copies of one judge (answers Q3).
|
|
30
|
+
|
|
31
|
+
## Step 4 — External verification
|
|
32
|
+
|
|
33
|
+
<!-- This dispatch's own citations were gated this way before Step 5 was written. -->
|
|
34
|
+
|
|
35
|
+
- [x] every citation resolved by retrieval (arXiv/DOI), not model memory — arXiv API + OpenAlex + Crossref
|
|
36
|
+
- [x] every finding matches what its source actually claims (groundedness) — checked against each abstract
|
|
37
|
+
- [x] >= 3 decorrelated lenses (retrieval oracle + >= 2 different model families) — oracle + Mistral + IBM Granite, reasoning-stripped
|
|
38
|
+
|
|
39
|
+
Result: all six citations VERIFIED (existence + attribution + groundedness). Two blind traps seeded into a sibling set — a misattribution and a fabricated paper — were caught by the *union* of the two families, not either alone.
|
|
40
|
+
|
|
41
|
+
## Step 5 — Architecture
|
|
42
|
+
|
|
43
|
+
- The verifier is a **different model family** from the synthesizer, run reasoning-stripped. (findings 1, 2, 3)
|
|
44
|
+
- Verification is **two-stage per citation**: a retrieval oracle confirms existence, then a groundedness lens confirms the source supports the claim. (findings 4, 5)
|
|
45
|
+
- The verifier is an **ensemble of decorrelated lenses** (retrieval oracle + ≥2 different families), because diversity — not count — drives coverage. (finding 6)
|
|
46
|
+
- On a non-clean verdict the finding **halts** (fabricated → dropped; misattributed → corrected once; unavailable → escalate), never silently proceeds. (findings 1, 4)
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@dogfood-lab/study-swarm",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "1.0.0",
|
|
4
4
|
"description": "Ground design decisions in cited research, then verify every citation with a different model family before it becomes canon — a research-grounded design protocol, with a thin CLI.",
|
|
5
5
|
"keywords": [
|
|
6
6
|
"methodology",
|
|
@@ -34,6 +34,7 @@
|
|
|
34
34
|
},
|
|
35
35
|
"files": [
|
|
36
36
|
"bin/",
|
|
37
|
+
"examples/",
|
|
37
38
|
"README.md",
|
|
38
39
|
"README.ja.md",
|
|
39
40
|
"README.zh.md",
|