@mcptoolshop/research-os 0.3.3 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.3.3"><img src="https://img.shields.io/badge/version-0.3.3-blue" alt="version 0.3.3"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,46 @@ This is the structural alternative to *search → summarize → pretty report*.
149
149
 
150
150
  `research-os` is a local-first CLI. It reads and writes files within the research-pack directory you point it at, and (when using `gather`) issues outbound HTTP requests to fetch source URLs you provide. It does not: run a server, accept inbound connections, store credentials, or send telemetry. No secrets are written to pack artifacts. See [SECURITY.md](SECURITY.md) for the vulnerability reporting policy.
151
151
 
152
+ ## Reviewer calibration
153
+
154
+ v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because
155
+ it ran once; it earns a status through structured seeded-failure receipts and
156
+ multi-run aggregation.
157
+
158
+ **No profile is currently admitted as `trusted_baseline`.** The canonical receipts
159
+ in the repo show `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`,
160
+ `hermes-single-pass=comparison_only`. This is intentional: trust is earned through
161
+ repeated seeded-failure evidence, not assumed.
162
+
163
+ Calibration receipts live at `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`.
164
+ Each receipt records PASS/FAIL against seven bars, four status labels
165
+ (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), and
166
+ honestly discloses what the fixture cannot test (`needs_contradiction_mapping`
167
+ is unreachable from `seeded-v1`). See [CHANGELOG.md](CHANGELOG.md).
168
+
169
+ ```bash
170
+ # Single-run calibration (quick local check)
171
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
172
+
173
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
174
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
175
+
176
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
177
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
178
+ ```
179
+
180
+ When `--runs <n>` is used, per-run receipts are written to `<profile>/runs/run-NNN.json`
181
+ and an aggregate receipt (with median-based bars and recurring-failure detection) is written
182
+ to `<profile>/seeded-v1.{json,md}`. The aggregate receipt carries `receipt_kind: 'aggregate'`
183
+ to discriminate from single-run receipts. Single-run mode (`--runs 1` or omitted) preserves
184
+ the existing direct-write behavior.
185
+
152
186
  ## Status
153
187
 
188
+ **v0.5.0** — published to npm as `@mcptoolshop/research-os@0.5.0`, 2026-05-10. v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because it ran once; it earns a status through structured seeded-failure receipts and multi-run aggregation. Ships: structured calibration receipt schema (`seeded-v1.{json,md}`, Zod-validated, four status labels); multi-run harness (`--runs <n>`, per-run isolation, median-based PASS/FAIL bars, recurring-failure demotion); architecture-aware decision-vocab bar; pack-relative receipt lookup in `review-promote`. **No trusted baseline admitted:** `hermes-two-pass=failed` (aggregate, 3 runs), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. research-os can now refuse to trust a reviewer profile when repeated seeded failures do not support trust. **No gate, freeze, or synthesis-law changes. All four frozen packs verify-pack byte-identically.** 671/671 vitest passing. See [CHANGELOG.md](CHANGELOG.md).
189
+
190
+ **v0.4.0** — published to npm as `@mcptoolshop/research-os@0.4.0`, 2026-05-10. v0.4.0 makes source identity durable. Deterministic source-type rules handle the repeatable majority, override ledgers preserve operator corrections across re-gather, and `source-card audit` replaces scratch-script drift checks with a first-class CLI surface. Ships: centralized source-type classifier (Component B — `classifySourceType`, 11 canonical vendors, `source-type-rules.json`); source-card override ledger (Component A — `source-card-overrides.jsonl`, `validate` + `list` subcommands); and source-card audit CLI (Component D — `research-os source-card audit --pack <dir>`, 7 finding kinds, JSON + Markdown artifacts, `--apply --from` apply path). F-46 cosmetic fix: pack manifests now stamp the live binary version rather than the version frozen into `research.yaml` at pack-init. **No gate, freeze, or synthesis-law changes. All four existing frozen packs verify-pack byte-identically.** 620/620 vitest passing. See [CHANGELOG.md](CHANGELOG.md) and the [source-card audit handbook page](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
191
+
154
192
  **v0.3.3** — published to npm as `@mcptoolshop/research-os@0.3.3`, 2026-05-10. Ships gate-semantics clarity earned by Pack-3 (Godot export/runtime durability, Experiment 3 pack #3 of 3). Gate output now carries section-scoped publisher + primary counts alongside pack-wide counts (F-43); `no_source_cluster_monopoly` reworded from WARN to informational diagnostic (F-41). **Pass/fail behavior unchanged; existing frozen packs verify-pack byte-identically.** 570/570 vitest passing. See [CHANGELOG.md](CHANGELOG.md) and [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
155
193
 
156
194
  **v0.3.2** — published to npm as `@mcptoolshop/research-os@0.3.2`, 2026-05-09. Ships normalized accepted-claim accounting for `pack publish` admission. The strict equality check between `claim-reviews.jsonl` and `pack-audit.json::accepted_claims` is replaced with an effective-set comparison — accepted claims are unique `claim_id`s whose latest canonical review decision is `accepted_for_synthesis` (latest-decision-wins per `claim_id`). Frozen packs whose legacy audit count differs from the effective set now admit with a warning rather than refusing; the legacy audit file is preserved verbatim (Law 15) while the archive manifest reflects the normalized count. Refusal stays hard for phantom claim_ids, incompatible duplicate decisions, and non-synthesis-eligible gates. Earned by Experiment 3 XRPL pack Session K — pack publish refused on a real closure-ledger seam disagreement (Section 07 had 24 raw `accepted_for_synthesis` rows but only 19 unique `claim_id`s due to overlapping reviewer windows). 558/558 vitest passing. See [CHANGELOG.md](CHANGELOG.md) and [`docs/pack-publish.md`](docs/pack-publish.md).
package/README.pt-BR.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.3.3"><img src="https://img.shields.io/badge/version-0.3.3-blue" alt="version 0.3.3"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,33 @@ Esta é a alternativa estrutural para *pesquisar → resumir → gerar relatóri
149
149
 
150
150
  `research-os` é uma ferramenta de linha de comando que opera localmente. Ela lê e grava arquivos dentro do diretório do pacote de pesquisa que você especificar e, quando usa o comando `gather`, faz solicitações HTTP para buscar URLs de origem que você fornecer. Ela não: executa um servidor, aceita conexões de entrada, armazena credenciais ou envia dados de telemetria. Nenhum segredo é gravado nos arquivos do pacote. Consulte [SECURITY.md](SECURITY.md) para a política de relatório de vulnerabilidades.
151
151
 
152
+ ## Calibração de revisores
153
+
154
+ A versão v0.5.0 torna a calibração de revisores mais robusta. Um perfil de revisor não é considerado confiável apenas porque foi executado uma vez; ele adquire um status através de relatórios estruturados de falhas simuladas e agregação de múltiplas execuções.
155
+
156
+ **Atualmente, nenhum perfil é considerado como "baseline confiável".** Os relatórios canônicos no repositório mostram `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. Isso é intencional: a confiança é conquistada através de evidências repetidas de falhas simuladas, e não é presumida.
157
+
158
+ Os relatórios de calibração estão localizados em `calibration/reviewer-profiles/<perfil>/seeded-v1.{json,md}`. Cada relatório registra PASS/FAIL em relação a sete critérios, quatro rótulos de status (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), e revela honestamente o que o teste não consegue verificar (`needs_contradiction_mapping` é inacessível a partir de `seeded-v1`). Consulte [CHANGELOG.md](CHANGELOG.md).
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ Quando `--runs <n>` é usado, os relatórios de cada execução são gravados em `<perfil>/runs/run-NNN.json` e um relatório agregado (com critérios baseados na mediana e detecção de falhas recorrentes) é gravado em `<perfil>/seeded-v1.{json,md}`. O relatório agregado contém `receipt_kind: 'aggregate'` para diferenciá-lo dos relatórios de execução única. O modo de execução única (`--runs 1` ou omitido) preserva o comportamento de gravação direta existente.
172
+
152
173
  ## Status
153
174
 
175
+ **v0.5.0** — publicado no npm como `@mcptoolshop/research-os@0.5.0`, 10 de maio de 2026. A versão v0.5.0 torna a calibração de revisores mais robusta. Um perfil de revisor não é considerado confiável apenas porque foi executado uma vez; ele adquire um status através de relatórios estruturados de falhas simuladas e agregação de múltiplas execuções. Inclui: esquema de relatório de calibração estruturado (`seeded-v1.{json,md}`, validado com Zod, quatro rótulos de status); sistema de execução de múltiplas execuções (`--runs <n>`, isolamento por execução, critérios PASS/FAIL baseados na mediana, detecção de falhas recorrentes); critério de avaliação baseado na arquitetura; pesquisa de relatórios relativa ao pacote em `review-promote`. **Nenhum baseline confiável admitido:** `hermes-two-pass=failed` (agregado, 3 execuções), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. O research-os agora pode recusar a confiança em um perfil de revisor quando falhas simuladas repetidas não suportam a confiança. **Nenhuma alteração nos gates, congelamentos ou leis de síntese. Todos os quatro pacotes congelados verificam a identidade dos bytes.** 671/671 testes vitest aprovados. Consulte [CHANGELOG.md](CHANGELOG.md).
176
+
177
+ **v0.4.0** — Publicada no npm como `@mcptoolshop/research-os@0.4.0`, 10 de maio de 2026. A versão 0.4.0 garante a durabilidade da identidade da fonte. Regras determinísticas para o tipo de fonte lidam com a maioria repetível, os registros de substituição preservam as correções do operador durante a re-coleta, e o comando `source-card audit` substitui as verificações de derivação de scripts por uma interface de linha de comando (CLI) completa. Inclui: um classificador centralizado de tipo de fonte (Componente B — `classifySourceType`, 11 fornecedores padrão, `source-type-rules.json`); um registro de substituição de cartão de fonte (Componente A — `source-card-overrides.jsonl`, subcomandos `validate` e `list`); e uma CLI para auditoria de cartão de fonte (Componente D — `research-os source-card audit --pack <dir>`, 7 tipos de detecção, artefatos JSON + Markdown, opções `--apply --from` para aplicar o caminho). Correção estética F-46: os arquivos de manifesto agora indicam a versão binária em execução, em vez da versão fixada no arquivo `research.yaml` durante a inicialização da criação do pacote. **Não há alterações nas regras de validação, congelamento ou síntese. Todos os quatro pacotes existentes passam na verificação de integridade byte a byte.** 620/620 testes vitest aprovados. Consulte o arquivo [CHANGELOG.md](CHANGELOG.md) e a página do manual de auditoria de cartão de fonte: [https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
178
+
154
179
  **v0.3.3** — Publicado no npm como `@mcptoolshop/research-os@0.3.3`, 10 de maio de 2026. Inclui melhorias na clareza da semântica das "gates" obtidas com o Pack-3 (durabilidade da exportação/runtime do Godot, Experimento 3, pacote nº 3 de 3). A saída da "gate" agora inclui contadores específicos da seção, além dos contadores globais (F-43); a mensagem `no_source_cluster_monopoly` foi alterada de um aviso para um diagnóstico informativo (F-41). **O comportamento de aprovação/reprovação não foi alterado; os pacotes congelados existentes são verificados byte a byte.** 570/570 testes do vitest passaram. Consulte o arquivo [CHANGELOG.md](CHANGELOG.md) e o arquivo [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
155
180
 
156
181
  **v0.3.2** — Publicado no npm como `@mcptoolshop/research-os@0.3.2`, 09 de maio de 2026. Inclui a normalização das reivindicações aceitas, levando em consideração a aprovação para publicação do pacote. A verificação estrita de igualdade entre `claim-reviews.jsonl` e `pack-audit.json::accepted_claims` foi substituída por uma comparação de conjuntos efetivos — as reivindicações aceitas são os `claim_id`s únicos cuja última decisão de revisão canônica é "aceita para síntese" (a última decisão prevalece para cada `claim_id`). Pacotes congelados cuja contagem de auditoria legada difere do conjunto efetivo agora são aceitos com um aviso, em vez de serem rejeitados; o arquivo de auditoria legada é preservado integralmente (Lei 15), enquanto o manifesto do arquivo reflete a contagem normalizada. A rejeição permanece intransigente para `claim_id`s inexistentes, decisões duplicadas incompatíveis e restrições não elegíveis para síntese. Obtido através do Experimento 3 XRPL pack Session K — a publicação do pacote foi rejeitada devido a uma divergência real no registro de fechamento (a Seção 07 continha 24 linhas brutas de "aceito para síntese", mas apenas 19 `claim_id`s únicos devido a janelas de revisores sobrepostas). 558/558 testes vitest passaram. Consulte [CHANGELOG.md](CHANGELOG.md) e [`docs/pack-publish.md`](docs/pack-publish.md).
package/README.zh.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.3.3"><img src="https://img.shields.io/badge/version-0.3.3-blue" alt="version 0.3.3"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,33 @@ discover
149
149
 
150
150
  `research-os` 是一个本地优先的命令行工具。它在您指定的“研究包”目录中读取和写入文件,并在使用 `gather` 命令时,会向外部发送 HTTP 请求以获取您提供的来源 URL。它不会:运行服务器、接受传入连接、存储凭据或发送遥测数据。任何敏感信息都不会写入到包文件中。请参阅 [SECURITY.md](SECURITY.md),了解漏洞报告政策。
151
151
 
152
+ ## 评审员校准
153
+
154
+ v0.5.0版本使评审员校准更加可靠。评审员配置文件不会因为只运行一次而被信任,而是通过结构化的、带有预设错误的测试结果和多次运行的聚合来获得信任状态。
155
+
156
+ **目前没有任何配置文件被认为是`trusted_baseline`(可信基线)。** 仓库中的标准测试结果显示`hermes-two-pass=failed`(失败),`mistral-nemo-two-pass=conditional_pass`(条件通过),`hermes-single-pass=comparison_only`(仅供比较)。这是有意为之:信任是通过反复的、带有预设错误的结果来获得的,而不是默认信任。
157
+
158
+ 校准结果文件位于`calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`。每个结果文件记录了针对七个方面的PASS/FAIL(通过/失败)结果,四个状态标签(`trusted_baseline`、`conditional_pass`、`failed`、`comparison_only`),并诚实地披露了测试框架无法测试的内容(`needs_contradiction_mapping`无法从`seeded-v1`访问)。请参阅[CHANGELOG.md](CHANGELOG.md)。
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ 当使用`--runs <n>`参数时,每个运行的结果文件会被写入到`<profile>/runs/run-NNN.json`,并且会生成一个聚合结果文件(包含基于中位数的PASS/FAIL结果,以及重复失败检测),写入到`<profile>/seeded-v1.{json,md}`。聚合结果文件包含`receipt_kind: 'aggregate'`,用于区分单次运行的结果文件。单次运行模式(`--runs 1`或省略)会保留现有的直接写入行为。
172
+
152
173
  ## 状态
153
174
 
175
+ **v0.5.0** — 发布到npm,版本号为`@mcptoolshop/research-os@0.5.0`,发布日期:2026-05-10。v0.5.0版本使评审员校准更加可靠。评审员配置文件不会因为只运行一次而被信任,而是通过结构化的、带有预设错误的测试结果和多次运行的聚合来获得信任状态。包含:结构化的校准结果模式(`seeded-v1.{json,md}`,经过Zod验证,包含四个状态标签);多运行测试框架(`--runs <n>`,每个运行隔离,基于中位数的PASS/FAIL结果,重复失败降级);能够感知架构的决策词汇表;在`review-promote`中进行包相关的结果文件查找。**没有可信的基线:** `hermes-two-pass=failed`(聚合,3次运行),`mistral-nemo-two-pass=conditional_pass`,`hermes-single-pass=comparison_only`。research-os现在可以拒绝信任评审员配置文件,当反复的、带有预设错误的测试结果不支持信任时。**没有对网关、冻结或合成规则的更改。所有四个现有的冻结包都以字节级别的相同方式进行验证。** 671/671个vitest测试通过。请参阅[CHANGELOG.md](CHANGELOG.md)。
176
+
177
+ **v0.4.0** — 发布到npm,版本号为`@mcptoolshop/research-os@0.4.0`,发布日期:2026-05-10。v0.4.0版本使源代码身份更加可靠。基于确定性的源代码类型规则处理可重复的多数情况,覆盖账本保留了操作员的更正,并且`source-card audit`(源代码卡审计)取代了对临时脚本漂移的检查,提供了一个一流的命令行界面。包含:集中式的源代码类型分类器(组件B — `classifySourceType`,11个标准供应商,`source-type-rules.json`);源代码卡覆盖账本(组件A — `source-card-overrides.jsonl`,`validate` + `list`子命令);以及源代码卡审计命令行界面(组件D — `research-os source-card audit --pack <dir>`,7种发现类型,JSON + Markdown格式,`--apply --from`用于应用路径)。F-46:一个小的修复,现在包清单会记录实际的二进制版本,而不是冻结在`research.yaml`中的版本,该版本在包初始化时被冻结。**没有对网关、冻结或合成规则的更改。所有四个现有的冻结包都以字节级别的相同方式进行验证。** 620/620个vitest测试通过。请参阅[CHANGELOG.md](CHANGELOG.md)以及[源代码卡审计手册页面](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/)。
178
+
154
179
  **v0.3.3** — 已发布到 npm,版本号为 `@mcptoolshop/research-os@0.3.3`,发布日期:2026年5月10日。此版本改进了“门”机制的语义清晰度,这是Pack-3(Godot导出/运行时稳定性,实验3的第3个包)所取得的成果。现在,“门”的输出结果除了包含整个包的计数外,还包含按“门”划分的发布者和主要计数(F-43);`no_source_cluster_monopoly` 的警告信息已更改为信息性诊断信息(F-41)。**通过/失败的行为未改变;现有的冻结包在字节级别上进行验证。** 570/570 个 vitest 测试通过。请参阅 [CHANGELOG.md](CHANGELOG.md) 和 [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md)。
155
180
 
156
181
  **v0.3.2** — 已发布到 npm,版本号为 `@mcptoolshop/research-os@0.3.2`,发布日期:2026年5月9日。此版本对“已接受的声明”进行了标准化处理,以适应“包发布”的流程。严格的 `claim-reviews.jsonl` 文件和 `pack-audit.json::accepted_claims` 之间的相等性检查已被替换为集合比较——已接受的声明是具有最新规范审查决策为 `accepted_for_synthesis` 的唯一 `claim_id`(`claim_id` 遵循“最新决策优先”原则)。对于那些其历史审计计数与集合比较结果不同的冻结包,现在会发出警告而不是拒绝;原始的审计文件将被完整保留(第15条规定),而归档清单会反映标准化后的计数。对于虚假 `claim_id`、不兼容的重复决策以及不符合合成条件的“门”,仍然会拒绝。这是 Experiment 3 XRPL pack Session K 的成果——由于实际的账本关闭时的差异,包发布被拒绝(第07部分有 24 行原始的 `accepted_for_synthesis` 数据,但由于审查窗口的重叠,只有 19 个唯一的 `claim_id`)。558/558 个 vitest 测试通过。请参阅 [CHANGELOG.md](CHANGELOG.md) 和 [`docs/pack-publish.md`](docs/pack-publish.md)。
@@ -0,0 +1,509 @@
1
+ import { z } from 'zod';
2
+
3
+ declare const AggregateMetricSchema: z.ZodObject<{
4
+ median: z.ZodNumber;
5
+ min: z.ZodNumber;
6
+ max: z.ZodNumber;
7
+ values: z.ZodArray<z.ZodNumber, "many">;
8
+ }, "strip", z.ZodTypeAny, {
9
+ median: number;
10
+ min: number;
11
+ max: number;
12
+ values: number[];
13
+ }, {
14
+ median: number;
15
+ min: number;
16
+ max: number;
17
+ values: number[];
18
+ }>;
19
+ declare const PerCategoryAggregateEntrySchema: z.ZodObject<{
20
+ median_ratio: z.ZodNumber;
21
+ min_ratio: z.ZodNumber;
22
+ max_ratio: z.ZodNumber;
23
+ total: z.ZodNumber;
24
+ per_run_ratios: z.ZodArray<z.ZodNumber, "many">;
25
+ }, "strip", z.ZodTypeAny, {
26
+ median_ratio: number;
27
+ min_ratio: number;
28
+ max_ratio: number;
29
+ total: number;
30
+ per_run_ratios: number[];
31
+ }, {
32
+ median_ratio: number;
33
+ min_ratio: number;
34
+ max_ratio: number;
35
+ total: number;
36
+ per_run_ratios: number[];
37
+ }>;
38
+ declare const PerCategoryAggregateSchema: z.ZodRecord<z.ZodString, z.ZodObject<{
39
+ median_ratio: z.ZodNumber;
40
+ min_ratio: z.ZodNumber;
41
+ max_ratio: z.ZodNumber;
42
+ total: z.ZodNumber;
43
+ per_run_ratios: z.ZodArray<z.ZodNumber, "many">;
44
+ }, "strip", z.ZodTypeAny, {
45
+ median_ratio: number;
46
+ min_ratio: number;
47
+ max_ratio: number;
48
+ total: number;
49
+ per_run_ratios: number[];
50
+ }, {
51
+ median_ratio: number;
52
+ min_ratio: number;
53
+ max_ratio: number;
54
+ total: number;
55
+ per_run_ratios: number[];
56
+ }>>;
57
+ declare const AggregatePassFailSchema: z.ZodObject<{
58
+ fp_ceiling: z.ZodEnum<["PASS", "FAIL"]>;
59
+ any_flag_recall_floor: z.ZodEnum<["PASS", "FAIL"]>;
60
+ per_category_any_flag_floor: z.ZodEnum<["PASS", "FAIL"]>;
61
+ strict_recall_floor: z.ZodEnum<["PASS", "FAIL"]>;
62
+ decision_vocab_completeness: z.ZodEnum<["PASS", "FAIL"]>;
63
+ latency_soft: z.ZodEnum<["PASS", "WARN"]>;
64
+ latency_hard: z.ZodEnum<["PASS", "FAIL"]>;
65
+ empty_or_malformed: z.ZodEnum<["PASS", "FAIL"]>;
66
+ overall: z.ZodEnum<["PASS", "FAIL"]>;
67
+ }, "strip", z.ZodTypeAny, {
68
+ fp_ceiling: "PASS" | "FAIL";
69
+ any_flag_recall_floor: "PASS" | "FAIL";
70
+ per_category_any_flag_floor: "PASS" | "FAIL";
71
+ strict_recall_floor: "PASS" | "FAIL";
72
+ decision_vocab_completeness: "PASS" | "FAIL";
73
+ latency_soft: "PASS" | "WARN";
74
+ latency_hard: "PASS" | "FAIL";
75
+ empty_or_malformed: "PASS" | "FAIL";
76
+ overall: "PASS" | "FAIL";
77
+ }, {
78
+ fp_ceiling: "PASS" | "FAIL";
79
+ any_flag_recall_floor: "PASS" | "FAIL";
80
+ per_category_any_flag_floor: "PASS" | "FAIL";
81
+ strict_recall_floor: "PASS" | "FAIL";
82
+ decision_vocab_completeness: "PASS" | "FAIL";
83
+ latency_soft: "PASS" | "WARN";
84
+ latency_hard: "PASS" | "FAIL";
85
+ empty_or_malformed: "PASS" | "FAIL";
86
+ overall: "PASS" | "FAIL";
87
+ }>;
88
+ declare const AggregateDecisionVocabBarSchema: z.ZodObject<{
89
+ architecture: z.ZodEnum<["single-pass", "two-pass"]>;
90
+ required: z.ZodNumber;
91
+ median_produced: z.ZodNumber;
92
+ passed: z.ZodBoolean;
93
+ }, "strip", z.ZodTypeAny, {
94
+ required: number;
95
+ architecture: "single-pass" | "two-pass";
96
+ median_produced: number;
97
+ passed: boolean;
98
+ }, {
99
+ required: number;
100
+ architecture: "single-pass" | "two-pass";
101
+ median_produced: number;
102
+ passed: boolean;
103
+ }>;
104
+ declare const AggregateCalibrationReceiptSchema: z.ZodObject<{
105
+ schema_version: z.ZodLiteral<1>;
106
+ receipt_kind: z.ZodLiteral<"aggregate">;
107
+ profile_name: z.ZodString;
108
+ status: z.ZodEnum<["trusted_baseline", "conditional_pass", "failed", "comparison_only"]>;
109
+ model: z.ZodString;
110
+ architecture: z.ZodEnum<["single-pass", "two-pass"]>;
111
+ fixture: z.ZodString;
112
+ fixture_total_claims: z.ZodNumber;
113
+ fixture_good_claims: z.ZodNumber;
114
+ fixture_bad_claims: z.ZodNumber;
115
+ runs_count: z.ZodNumber;
116
+ run_files: z.ZodArray<z.ZodString, "many">;
117
+ aggregated_at: z.ZodString;
118
+ research_os_version: z.ZodString;
119
+ good_fp_count: z.ZodObject<{
120
+ median: z.ZodNumber;
121
+ min: z.ZodNumber;
122
+ max: z.ZodNumber;
123
+ values: z.ZodArray<z.ZodNumber, "many">;
124
+ }, "strip", z.ZodTypeAny, {
125
+ median: number;
126
+ min: number;
127
+ max: number;
128
+ values: number[];
129
+ }, {
130
+ median: number;
131
+ min: number;
132
+ max: number;
133
+ values: number[];
134
+ }>;
135
+ any_flag_recall_ratio: z.ZodObject<{
136
+ median: z.ZodNumber;
137
+ min: z.ZodNumber;
138
+ max: z.ZodNumber;
139
+ values: z.ZodArray<z.ZodNumber, "many">;
140
+ }, "strip", z.ZodTypeAny, {
141
+ median: number;
142
+ min: number;
143
+ max: number;
144
+ values: number[];
145
+ }, {
146
+ median: number;
147
+ min: number;
148
+ max: number;
149
+ values: number[];
150
+ }>;
151
+ strict_recall_ratio: z.ZodObject<{
152
+ median: z.ZodNumber;
153
+ min: z.ZodNumber;
154
+ max: z.ZodNumber;
155
+ values: z.ZodArray<z.ZodNumber, "many">;
156
+ }, "strip", z.ZodTypeAny, {
157
+ median: number;
158
+ min: number;
159
+ max: number;
160
+ values: number[];
161
+ }, {
162
+ median: number;
163
+ min: number;
164
+ max: number;
165
+ values: number[];
166
+ }>;
167
+ decisions_produced_count: z.ZodObject<{
168
+ median: z.ZodNumber;
169
+ min: z.ZodNumber;
170
+ max: z.ZodNumber;
171
+ values: z.ZodArray<z.ZodNumber, "many">;
172
+ }, "strip", z.ZodTypeAny, {
173
+ median: number;
174
+ min: number;
175
+ max: number;
176
+ values: number[];
177
+ }, {
178
+ median: number;
179
+ min: number;
180
+ max: number;
181
+ values: number[];
182
+ }>;
183
+ runtime_ms: z.ZodObject<{
184
+ median: z.ZodNumber;
185
+ min: z.ZodNumber;
186
+ max: z.ZodNumber;
187
+ values: z.ZodArray<z.ZodNumber, "many">;
188
+ }, "strip", z.ZodTypeAny, {
189
+ median: number;
190
+ min: number;
191
+ max: number;
192
+ values: number[];
193
+ }, {
194
+ median: number;
195
+ min: number;
196
+ max: number;
197
+ values: number[];
198
+ }>;
199
+ empty_or_malformed_responses: z.ZodObject<{
200
+ median: z.ZodNumber;
201
+ min: z.ZodNumber;
202
+ max: z.ZodNumber;
203
+ values: z.ZodArray<z.ZodNumber, "many">;
204
+ }, "strip", z.ZodTypeAny, {
205
+ median: number;
206
+ min: number;
207
+ max: number;
208
+ values: number[];
209
+ }, {
210
+ median: number;
211
+ min: number;
212
+ max: number;
213
+ values: number[];
214
+ }>;
215
+ per_category_any_flag: z.ZodRecord<z.ZodString, z.ZodObject<{
216
+ median_ratio: z.ZodNumber;
217
+ min_ratio: z.ZodNumber;
218
+ max_ratio: z.ZodNumber;
219
+ total: z.ZodNumber;
220
+ per_run_ratios: z.ZodArray<z.ZodNumber, "many">;
221
+ }, "strip", z.ZodTypeAny, {
222
+ median_ratio: number;
223
+ min_ratio: number;
224
+ max_ratio: number;
225
+ total: number;
226
+ per_run_ratios: number[];
227
+ }, {
228
+ median_ratio: number;
229
+ min_ratio: number;
230
+ max_ratio: number;
231
+ total: number;
232
+ per_run_ratios: number[];
233
+ }>>;
234
+ per_category_strict: z.ZodRecord<z.ZodString, z.ZodObject<{
235
+ median_ratio: z.ZodNumber;
236
+ min_ratio: z.ZodNumber;
237
+ max_ratio: z.ZodNumber;
238
+ total: z.ZodNumber;
239
+ per_run_ratios: z.ZodArray<z.ZodNumber, "many">;
240
+ }, "strip", z.ZodTypeAny, {
241
+ median_ratio: number;
242
+ min_ratio: number;
243
+ max_ratio: number;
244
+ total: number;
245
+ per_run_ratios: number[];
246
+ }, {
247
+ median_ratio: number;
248
+ min_ratio: number;
249
+ max_ratio: number;
250
+ total: number;
251
+ per_run_ratios: number[];
252
+ }>>;
253
+ decision_vocabulary: z.ZodRecord<z.ZodString, z.ZodObject<{
254
+ median: z.ZodNumber;
255
+ min: z.ZodNumber;
256
+ max: z.ZodNumber;
257
+ values: z.ZodArray<z.ZodNumber, "many">;
258
+ }, "strip", z.ZodTypeAny, {
259
+ median: number;
260
+ min: number;
261
+ max: number;
262
+ values: number[];
263
+ }, {
264
+ median: number;
265
+ min: number;
266
+ max: number;
267
+ values: number[];
268
+ }>>;
269
+ decision_vocab_bar: z.ZodObject<{
270
+ architecture: z.ZodEnum<["single-pass", "two-pass"]>;
271
+ required: z.ZodNumber;
272
+ median_produced: z.ZodNumber;
273
+ passed: z.ZodBoolean;
274
+ }, "strip", z.ZodTypeAny, {
275
+ required: number;
276
+ architecture: "single-pass" | "two-pass";
277
+ median_produced: number;
278
+ passed: boolean;
279
+ }, {
280
+ required: number;
281
+ architecture: "single-pass" | "two-pass";
282
+ median_produced: number;
283
+ passed: boolean;
284
+ }>;
285
+ unreachable_decisions: z.ZodArray<z.ZodString, "many">;
286
+ pass_fail: z.ZodObject<{
287
+ fp_ceiling: z.ZodEnum<["PASS", "FAIL"]>;
288
+ any_flag_recall_floor: z.ZodEnum<["PASS", "FAIL"]>;
289
+ per_category_any_flag_floor: z.ZodEnum<["PASS", "FAIL"]>;
290
+ strict_recall_floor: z.ZodEnum<["PASS", "FAIL"]>;
291
+ decision_vocab_completeness: z.ZodEnum<["PASS", "FAIL"]>;
292
+ latency_soft: z.ZodEnum<["PASS", "WARN"]>;
293
+ latency_hard: z.ZodEnum<["PASS", "FAIL"]>;
294
+ empty_or_malformed: z.ZodEnum<["PASS", "FAIL"]>;
295
+ overall: z.ZodEnum<["PASS", "FAIL"]>;
296
+ }, "strip", z.ZodTypeAny, {
297
+ fp_ceiling: "PASS" | "FAIL";
298
+ any_flag_recall_floor: "PASS" | "FAIL";
299
+ per_category_any_flag_floor: "PASS" | "FAIL";
300
+ strict_recall_floor: "PASS" | "FAIL";
301
+ decision_vocab_completeness: "PASS" | "FAIL";
302
+ latency_soft: "PASS" | "WARN";
303
+ latency_hard: "PASS" | "FAIL";
304
+ empty_or_malformed: "PASS" | "FAIL";
305
+ overall: "PASS" | "FAIL";
306
+ }, {
307
+ fp_ceiling: "PASS" | "FAIL";
308
+ any_flag_recall_floor: "PASS" | "FAIL";
309
+ per_category_any_flag_floor: "PASS" | "FAIL";
310
+ strict_recall_floor: "PASS" | "FAIL";
311
+ decision_vocab_completeness: "PASS" | "FAIL";
312
+ latency_soft: "PASS" | "WARN";
313
+ latency_hard: "PASS" | "FAIL";
314
+ empty_or_malformed: "PASS" | "FAIL";
315
+ overall: "PASS" | "FAIL";
316
+ }>;
317
+ recurring_bar_failures: z.ZodArray<z.ZodString, "many">;
318
+ notes: z.ZodArray<z.ZodString, "many">;
319
+ }, "strip", z.ZodTypeAny, {
320
+ status: "trusted_baseline" | "conditional_pass" | "failed" | "comparison_only";
321
+ architecture: "single-pass" | "two-pass";
322
+ schema_version: 1;
323
+ receipt_kind: "aggregate";
324
+ profile_name: string;
325
+ model: string;
326
+ fixture: string;
327
+ fixture_total_claims: number;
328
+ fixture_good_claims: number;
329
+ fixture_bad_claims: number;
330
+ runs_count: number;
331
+ run_files: string[];
332
+ aggregated_at: string;
333
+ research_os_version: string;
334
+ good_fp_count: {
335
+ median: number;
336
+ min: number;
337
+ max: number;
338
+ values: number[];
339
+ };
340
+ any_flag_recall_ratio: {
341
+ median: number;
342
+ min: number;
343
+ max: number;
344
+ values: number[];
345
+ };
346
+ strict_recall_ratio: {
347
+ median: number;
348
+ min: number;
349
+ max: number;
350
+ values: number[];
351
+ };
352
+ decisions_produced_count: {
353
+ median: number;
354
+ min: number;
355
+ max: number;
356
+ values: number[];
357
+ };
358
+ runtime_ms: {
359
+ median: number;
360
+ min: number;
361
+ max: number;
362
+ values: number[];
363
+ };
364
+ empty_or_malformed_responses: {
365
+ median: number;
366
+ min: number;
367
+ max: number;
368
+ values: number[];
369
+ };
370
+ per_category_any_flag: Record<string, {
371
+ median_ratio: number;
372
+ min_ratio: number;
373
+ max_ratio: number;
374
+ total: number;
375
+ per_run_ratios: number[];
376
+ }>;
377
+ per_category_strict: Record<string, {
378
+ median_ratio: number;
379
+ min_ratio: number;
380
+ max_ratio: number;
381
+ total: number;
382
+ per_run_ratios: number[];
383
+ }>;
384
+ decision_vocabulary: Record<string, {
385
+ median: number;
386
+ min: number;
387
+ max: number;
388
+ values: number[];
389
+ }>;
390
+ decision_vocab_bar: {
391
+ required: number;
392
+ architecture: "single-pass" | "two-pass";
393
+ median_produced: number;
394
+ passed: boolean;
395
+ };
396
+ unreachable_decisions: string[];
397
+ pass_fail: {
398
+ fp_ceiling: "PASS" | "FAIL";
399
+ any_flag_recall_floor: "PASS" | "FAIL";
400
+ per_category_any_flag_floor: "PASS" | "FAIL";
401
+ strict_recall_floor: "PASS" | "FAIL";
402
+ decision_vocab_completeness: "PASS" | "FAIL";
403
+ latency_soft: "PASS" | "WARN";
404
+ latency_hard: "PASS" | "FAIL";
405
+ empty_or_malformed: "PASS" | "FAIL";
406
+ overall: "PASS" | "FAIL";
407
+ };
408
+ recurring_bar_failures: string[];
409
+ notes: string[];
410
+ }, {
411
+ status: "trusted_baseline" | "conditional_pass" | "failed" | "comparison_only";
412
+ architecture: "single-pass" | "two-pass";
413
+ schema_version: 1;
414
+ receipt_kind: "aggregate";
415
+ profile_name: string;
416
+ model: string;
417
+ fixture: string;
418
+ fixture_total_claims: number;
419
+ fixture_good_claims: number;
420
+ fixture_bad_claims: number;
421
+ runs_count: number;
422
+ run_files: string[];
423
+ aggregated_at: string;
424
+ research_os_version: string;
425
+ good_fp_count: {
426
+ median: number;
427
+ min: number;
428
+ max: number;
429
+ values: number[];
430
+ };
431
+ any_flag_recall_ratio: {
432
+ median: number;
433
+ min: number;
434
+ max: number;
435
+ values: number[];
436
+ };
437
+ strict_recall_ratio: {
438
+ median: number;
439
+ min: number;
440
+ max: number;
441
+ values: number[];
442
+ };
443
+ decisions_produced_count: {
444
+ median: number;
445
+ min: number;
446
+ max: number;
447
+ values: number[];
448
+ };
449
+ runtime_ms: {
450
+ median: number;
451
+ min: number;
452
+ max: number;
453
+ values: number[];
454
+ };
455
+ empty_or_malformed_responses: {
456
+ median: number;
457
+ min: number;
458
+ max: number;
459
+ values: number[];
460
+ };
461
+ per_category_any_flag: Record<string, {
462
+ median_ratio: number;
463
+ min_ratio: number;
464
+ max_ratio: number;
465
+ total: number;
466
+ per_run_ratios: number[];
467
+ }>;
468
+ per_category_strict: Record<string, {
469
+ median_ratio: number;
470
+ min_ratio: number;
471
+ max_ratio: number;
472
+ total: number;
473
+ per_run_ratios: number[];
474
+ }>;
475
+ decision_vocabulary: Record<string, {
476
+ median: number;
477
+ min: number;
478
+ max: number;
479
+ values: number[];
480
+ }>;
481
+ decision_vocab_bar: {
482
+ required: number;
483
+ architecture: "single-pass" | "two-pass";
484
+ median_produced: number;
485
+ passed: boolean;
486
+ };
487
+ unreachable_decisions: string[];
488
+ pass_fail: {
489
+ fp_ceiling: "PASS" | "FAIL";
490
+ any_flag_recall_floor: "PASS" | "FAIL";
491
+ per_category_any_flag_floor: "PASS" | "FAIL";
492
+ strict_recall_floor: "PASS" | "FAIL";
493
+ decision_vocab_completeness: "PASS" | "FAIL";
494
+ latency_soft: "PASS" | "WARN";
495
+ latency_hard: "PASS" | "FAIL";
496
+ empty_or_malformed: "PASS" | "FAIL";
497
+ overall: "PASS" | "FAIL";
498
+ };
499
+ recurring_bar_failures: string[];
500
+ notes: string[];
501
+ }>;
502
+ type AggregateMetric = z.infer<typeof AggregateMetricSchema>;
503
+ type PerCategoryAggregateEntry = z.infer<typeof PerCategoryAggregateEntrySchema>;
504
+ type PerCategoryAggregate = z.infer<typeof PerCategoryAggregateSchema>;
505
+ type AggregatePassFail = z.infer<typeof AggregatePassFailSchema>;
506
+ type AggregateDecisionVocabBar = z.infer<typeof AggregateDecisionVocabBarSchema>;
507
+ type AggregateCalibrationReceipt = z.infer<typeof AggregateCalibrationReceiptSchema>;
508
+
509
+ export { type AggregateCalibrationReceipt, AggregateCalibrationReceiptSchema, type AggregateDecisionVocabBar, AggregateDecisionVocabBarSchema, type AggregateMetric, AggregateMetricSchema, type AggregatePassFail, AggregatePassFailSchema, type PerCategoryAggregate, type PerCategoryAggregateEntry, PerCategoryAggregateEntrySchema, PerCategoryAggregateSchema };