@mcptoolshop/research-os 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,140 @@
2
2
 
3
3
  All notable changes to `research-os` are documented here.
4
4
 
5
+ ## [0.5.0] — 2026-05-10 — reviewer calibration as durable trust contract
6
+
7
+ ### F-50 stabilization (Session 5)
8
+
9
+ - **Multi-run calibration:** `--runs <n>` flag on the calibration harness. Per-run
10
+ receipts persist under `<profile>/runs/run-NNN.json`; aggregate receipts at
11
+ `<profile>/seeded-v1.{json,md}` use median-based PASS/FAIL rules.
12
+ - **Median aggregation rules:** FP ceiling (median ≤1 AND max ≤2), any-flag
13
+ recall (median ≥65%), per-category any-flag (median ≥50% per category with
14
+ total ≥2), strict recall (median ≥20%), decision vocab (architecture-aware
15
+ median ≥3 for two-pass / ≥4 for single-pass), latency hard (every run ≤20 min,
16
+ enforced via max), empty/malformed (every run =0, enforced via max).
17
+ - **Recurring-failure demotion:** a profile passes median rules but FAILed the
18
+ same bar in ≥⌈N/2⌉ individual runs → demoted from `trusted_baseline` to
19
+ `conditional_pass`. Prevents one lucky median from masking systemic bar weakness.
20
+ - **Single-run mode preserved:** harness without `--runs` (or `--runs 1`) writes
21
+ the existing single-run receipt directly. `comparison_only` profiles stay single-run.
22
+ - **New source files:** `src/calibration/aggregate-receipt-schema.ts` (aggregate Zod schema
23
+ with `receipt_kind: 'aggregate'` discriminator) and `src/calibration/aggregate.ts`
24
+ (pure helpers: `median`, `aggregateMetric`, `aggregatePerCategoryRecall`,
25
+ `aggregateDecisionVocabulary`, `computeAggregatePassFail`, `computeRecurringBarFailures`,
26
+ `computeAggregateStatusLabel`, `aggregateReceipts`, `buildAggregateReceiptMarkdown`).
27
+
28
+ ### Frictions closed (Session 5)
29
+
30
+ - **F-50** — Per-category any-flag floor was statistically unreliable at N=2–3
31
+ seeds per category (one missed claim drops a category from 67% to 33%). Median
32
+ aggregation across 3 runs absorbs single-run variance without lowering the bar.
33
+
34
+ ### Canonical receipt statuses (Session 5)
35
+
36
+ - `hermes-two-pass` (aggregate, 3 runs): **`failed`** — escalation. Run 1 PASS (FP=0,
37
+ any-flag=85%, decisions=3/6); runs 2–3 FAIL (any-flag 62%/46%, decisions=2/6).
38
+ Recurring failures: `any_flag_recall_floor`, `per_category_any_flag_floor`,
39
+ `decision_vocab_completeness`. Thesis NOT proven at N=3. Advisor decides path.
40
+ - `mistral-nemo-two-pass` (aggregate, 3 runs): **`conditional_pass`** — aggregate PASS
41
+ (median FP=1, max=2 at ceiling; median any-flag=69%; no recurring failures).
42
+ Run 1 FAIL (FP=2/5); runs 2–3 PASS.
43
+ - `hermes-single-pass` (single-run): **`comparison_only`** — auto-assigned.
44
+
45
+ v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because
46
+ it ran once; it earns a status through structured seeded-failure receipts and
47
+ multi-run aggregation.
48
+
49
+ **Product guardrail:** research-os can now refuse to trust a reviewer profile when
50
+ repeated seeded failures do not support trust.
51
+
52
+ **No trusted baseline admitted.** The three canonical receipts shipped with v0.5.0:
53
+
54
+ | Profile | Status | Notes |
55
+ |---|---|---|
56
+ | `hermes-two-pass` | `failed` | Aggregate, 3 runs. Recurring failures: any-flag recall, per-category floor, decision vocab. |
57
+ | `mistral-nemo-two-pass` | `conditional_pass` | Aggregate, 3 runs. FP at ceiling (median=1/max=2); no recurring failures. |
58
+ | `hermes-single-pass` | `comparison_only` | Auto-assigned; architectural comparison only. |
59
+
60
+ `trusted_baseline` is earned, not assumed. Single-run receipts remain available for quick
61
+ local checks; aggregate receipts (3+ runs, median-based bars) are the trust artifact.
62
+
63
+ ### Added
64
+
65
+ - **`profile?: string`** optional field on `ClaimReviewSchema`. Per-claim review
66
+ records can now carry the profile name that produced them. Existing records
67
+ without `profile` parse cleanly.
68
+ - **Structured calibration receipts** at `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`.
69
+ Zod-validated JSON + operator-readable Markdown sibling. `schema_version: 1`.
70
+ - **`research-os` source `--profile <name>` flag** on `scripts/reviewer-calibration.mjs`.
71
+ Drives output path and persists profile name on claim-review records.
72
+ - **Architecture-aware decision-vocabulary bar:** single-pass ≥ 4/6 decisions;
73
+ two-pass ≥ 3/6 decisions. Two-pass acknowledges that `narrow_critic` severity
74
+ escalation collapses the `needs_human_review` path.
75
+ - **Four status labels:** `trusted_baseline`, `conditional_pass`, `failed`,
76
+ `comparison_only`. `trusted_baseline` requires the canonical Hermes two-pass
77
+ profile + all bars pass + zero false positives.
78
+ - **`review-promote` receipt lookup is now pack-relative** — reads from
79
+ `<pack>/calibration/reviewer-profiles/<profile>/seeded-v1.json`, not
80
+ `process.cwd()`. Operators running `review-promote --pack <non-cwd-pack>`
81
+ now correctly resolve the receipt from the specified pack.
82
+ - **`review-promote` fails visibly on invalid receipts.** Previously a malformed
83
+ receipt was silently skipped; now any schema mismatch or JSON parse failure
84
+ throws `Invalid calibration receipt at <path>: <reason>`. Missing receipts
85
+ remain a no-op.
86
+ - **Three canonical receipts shipped:** `hermes-two-pass`, `mistral-nemo-two-pass`,
87
+ `hermes-single-pass` — all under `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`.
88
+ These are the v0.5.0 proof artifacts. See Session 4 status note below.
89
+
90
+ ### Frictions closed
91
+
92
+ - **F-48** — Structured calibration receipt persistence. The harness previously
93
+ wrote raw artifacts but no comparable receipt; recall metrics lived in
94
+ `console.log` only.
95
+ - **F-49** — Decision-vocabulary bar was miscalibrated against two-pass
96
+ architecture. Bar is now architecture-aware.
97
+
98
+ ### Compatibility
99
+
100
+ - All 4 frozen packs verify byte-identical against v0.3.3 baselines.
101
+ - `ClaimReviewSchema.profile` is optional (Zod `.optional()`, no `.default()`).
102
+ Existing pack records parse cleanly. Frozen pack receipts unchanged.
103
+ - No gate-law, freeze-law, or synthesis-law changes.
104
+
105
+ ### Out of scope
106
+
107
+ - `seeded-v1` cannot test `needs_contradiction_mapping` (no `unmapped_contradiction`
108
+ seeded). The receipt's `unreachable_decisions` array discloses this honestly.
109
+ Fixture expansion deferred to v0.6.
110
+ - `phi3:14b` calibration deferred to a later experiment.
111
+
112
+ ### Session 4 single-run evidence (context for F-50 investigation)
113
+
114
+ The three canonical receipts initially committed in Session 4 reflect honest single-run
115
+ evidence. `hermes-single-pass` produced `comparison_only` (auto-assigned). `hermes-two-pass`
116
+ and `mistral-nemo-two-pass` both produced `failed` across two single-run attempts each,
117
+ with high per-run variance:
118
+
119
+ - `hermes-two-pass` run 1: `per_category_any_flag_floor FAIL` (valid_but_low_value 1/3);
120
+ run 2: `decision_vocab_completeness FAIL` (2/6 decisions produced).
121
+ - `mistral-nemo-two-pass` run 1: `per_category_any_flag_floor FAIL` (unsupported_claim 1/3);
122
+ run 2: `fp_ceiling FAIL` (2/5 FP) + `per_category_any_flag_floor FAIL`.
123
+
124
+ This per-run nondeterminism (1–2 claim variance per category at N=2–3 seeds) was the
125
+ root cause of F-50. Session 5 resolved it via multi-run median aggregation. See
126
+ "Canonical receipt statuses (Session 5)" above for the aggregate outcome. The initial
127
+ Session 4 receipts were overwritten by the Session 5 aggregate re-runs.
128
+
129
+ ### Test surface
130
+
131
+ - 620 → 646 tests in Session 3 (+26).
132
+ - Session 4 adds 5 regression tests (646 → 651): pack-relative lookup,
133
+ cwd-irrelevance, invalid-receipt fail (×2), missing-receipt no-op.
134
+ - Session 5 adds 20 aggregate-helper tests (651 → 671): median, aggregateMetric,
135
+ per-category aggregation, decision-vocab aggregation, PASS/FAIL bars,
136
+ recurring-failure demotion, status-label logic, aggregateReceipts round-trip.
137
+ - **Cumulative (Experiments 4+5):** 570 → 671 tests (+101).
138
+
5
139
  ## [0.4.0] — 2026-05-10 — source-truth discipline
6
140
 
7
141
  v0.4.0 makes source identity durable. Deterministic source-type rules
package/README.es.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,9 +149,32 @@ Esta es la alternativa estructural a *búsqueda → resumen → informe detallad
149
149
 
150
150
  `research-os` es una herramienta de línea de comandos que funciona principalmente de forma local. Lee y escribe archivos dentro del directorio del paquete de investigación al que se le indica, y (cuando se utiliza `gather`) realiza solicitudes HTTP salientes para obtener las URL de origen que se proporcionan. No: ejecuta un servidor, acepta conexiones entrantes, almacena credenciales ni envía datos de telemetría. No se escriben secretos en los artefactos del paquete. Consulte [SECURITY.md](SECURITY.md) para obtener la política de notificación de vulnerabilidades.
151
151
 
152
+ ## Calibración de revisores
153
+
154
+ La versión v0.5.0 hace que la calibración de revisores sea más robusta. Un perfil de revisor no se considera confiable simplemente porque se ejecutó una vez; obtiene un estado a través de informes estructurados de fallos simulados y agregación de múltiples ejecuciones.
155
+
156
+ **Actualmente, ningún perfil se considera como una "línea de base confiable".** Los informes canónicos en el repositorio muestran `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. Esto es intencional: la confianza se gana a través de evidencia repetida de fallos simulados, no se asume.
157
+
158
+ Los informes de calibración se encuentran en `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`. Cada informe registra los resultados de PASADO/FALLIDO en siete categorías, cuatro etiquetas de estado (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), y revela honestamente qué aspectos no puede probar la prueba (`needs_contradiction_mapping` no es accesible desde `seeded-v1`). Consulte [CHANGELOG.md](CHANGELOG.md).
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ Cuando se utiliza `--runs <n>`, los informes de cada ejecución se escriben en `<profile>/runs/run-NNN.json` y se escribe un informe agregado (con barras basadas en la mediana y detección de fallos recurrentes) en `<profile>/seeded-v1.{json,md}`. El informe agregado incluye `receipt_kind: 'aggregate'` para distinguirlo de los informes de ejecución individual. El modo de ejecución individual (`--runs 1` o omitido) conserva el comportamiento de escritura directa existente.
172
+
152
173
  ## Estado
153
174
 
154
- **v0.4.0** — Publicada en npm como `@mcptoolshop/research-os@0.4.0`, el 10 de mayo de 2026. La versión 0.4.0 garantiza la durabilidad de la identidad de la fuente. Las reglas deterministas del tipo de fuente gestionan la mayoría repetible, los registros de anulación preservan las correcciones del operador a través de la re-compilación, y la herramienta de auditoría de "tarjeta de fuente" reemplaza las comprobaciones de deriva de los scripts temporales con una interfaz de línea de comandos (CLI) de primera clase. Incluye: un clasificador centralizado de tipos de fuente (Componente B `classifySourceType`, 11 proveedores canónicos, `source-type-rules.json`); un registro de anulación de "tarjeta de fuente" (Componente A `source-card-overrides.jsonl`, subcomandos `validate` y `list`); y una herramienta de auditoría de "tarjeta de fuente" para la línea de comandos (Componente D `research-os source-card audit --pack <dir>`, 7 tipos de hallazgos, artefactos JSON y Markdown, opciones `--apply --from` para aplicar la ruta). Corrección cosmética F-46: los archivos de manifiesto ahora indican la versión binaria actual en lugar de la versión congelada en `research.yaml` durante la inicialización de la compilación. **No se realizaron cambios en los mecanismos de validación, congelación ni en las reglas de síntesis. Los cuatro paquetes congelados existentes verifican la identidad de los bytes de forma idéntica.** 620/620 pruebas de vitest superadas. Consulte el archivo [CHANGELOG.md](CHANGELOG.md) y la página del manual de la herramienta de auditoría de "tarjeta de fuente" ([https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/)).
175
+ **v0.5.0** — publicado en npm como `@mcptoolshop/research-os@0.5.0`, 10 de mayo de 2026. La versión v0.5.0 hace que la calibración de revisores sea más robusta. Un perfil de revisor no se considera confiable simplemente porque se ejecutó una vez; obtiene un estado a través de informes estructurados de fallos simulados y agregación de múltiples ejecuciones. Incluye: esquema de informe de calibración estructurado (`seeded-v1.{json,md}`, validado con Zod, cuatro etiquetas de estado); entorno de ejecución para múltiples ejecuciones (`--runs <n>`, aislamiento por ejecución, barras de PASADO/FALLIDO basadas en la mediana, degradación por fallos recurrentes); barra de vocabulario de decisiones consciente de la arquitectura; búsqueda de informes relativa al paquete en `review-promote`. **No se admite ninguna línea de base confiable:** `hermes-two-pass=failed` (agregado, 3 ejecuciones), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. research-os ahora puede negarse a confiar en un perfil de revisor cuando los fallos simulados repetidos no respaldan la confianza. **No se realizan cambios en las puertas, la congelación o las leyes de síntesis. Los cuatro paquetes congelados existentes verifican la integridad de los bytes.** 671/671 pruebas de vitest superadas. Consulte [CHANGELOG.md](CHANGELOG.md).
176
+
177
+ **v0.4.0** — publicado en npm como `@mcptoolshop/research-os@0.4.0`, 10 de mayo de 2026. La versión v0.4.0 hace que la identidad de la fuente sea más robusta. Las reglas deterministas del tipo de fuente gestionan la mayoría repetible, los registros de anulación preservan las correcciones del operador a través de la recolección, y la "auditoría de la tarjeta de origen" reemplaza las comprobaciones de deriva de scripts con una interfaz de línea de comandos de primera clase. Incluye: clasificador centralizado de tipo de fuente (Componente B — `classifySourceType`, 11 proveedores canónicos, `source-type-rules.json`); registro de anulación de la tarjeta de origen (Componente A — `source-card-overrides.jsonl`, subcomandos `validate` + `list`); y CLI de auditoría de la tarjeta de origen (Componente D — `research-os source-card audit --pack <dir>`, 7 tipos de hallazgos, artefactos JSON + Markdown, opciones `--apply --from` para aplicar la ruta). Corrección cosmética F-46: los manifiestos de los paquetes ahora imprimen la versión binaria en vivo en lugar de la versión congelada en `research.yaml` durante la inicialización del paquete. **No se realizan cambios en las puertas, la congelación o las leyes de síntesis. Los cuatro paquetes congelados existentes verifican la integridad de los bytes.** 620/620 pruebas de vitest superadas. Consulte [CHANGELOG.md](CHANGELOG.md) y la [página del manual de auditoría de la tarjeta de origen](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
155
178
 
156
179
  **v0.3.3** — Publicada en npm como `@mcptoolshop/research-os@0.3.3`, 10 de mayo de 2026. Incluye mejoras en la claridad de la semántica de las "gates" obtenidas gracias a Pack-3 (durabilidad de la exportación/ejecución de Godot, Experimento 3, paquete #3 de 3). La salida de la "gate" ahora incluye el publicador y los conteos específicos de la sección, junto con los conteos generales del paquete (F-43); se ha reformulado `no_source_cluster_monopoly` de una advertencia a un diagnóstico informativo (F-41). **El comportamiento de aprobación/rechazo no ha cambiado; los paquetes congelados existentes se verifican byte a byte.** 570/570 pruebas de vitest superadas. Consulte [CHANGELOG.md](CHANGELOG.md) y [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
157
180
 
package/README.fr.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,31 @@ Ceci est une alternative structurée à *recherche → résumé → rapport dét
149
149
 
150
150
  `research-os` est une interface en ligne de commande qui fonctionne localement. Elle lit et écrit des fichiers dans le répertoire de l'ensemble de données que vous lui spécifiez, et (lorsque vous utilisez la commande `gather`), elle effectue des requêtes HTTP sortantes pour récupérer les URL de sources que vous fournissez. Elle ne : ne fait pas fonctionner de serveur, n'accepte pas de connexions entrantes, ne stocke pas de mots de passe, ni n'envoie de données de télémétrie. Aucun mot de passe n'est écrit dans les fichiers de l'ensemble de données. Consultez le fichier [SECURITY.md](SECURITY.md) pour connaître la politique de signalement des vulnérabilités.
151
151
 
152
+ ## Calibrage des évaluateurs
153
+
154
+ La version 0.5.0 rend le calibrage des évaluateurs plus fiable. Un profil d'évaluateur n'est pas considéré comme fiable simplement parce qu'il a été exécuté une fois ; il acquiert un statut grâce à des rapports structurés de défaillances simulées et à une agrégation de plusieurs exécutions.
155
+
156
+ **Aucun profil n'est actuellement admis comme étant une "baseline de confiance".** Les rapports canoniques dans le dépôt indiquent `hermes-two-pass=échec`, `mistral-nemo-two-pass=succès conditionnel`, `hermes-single-pass=comparaison uniquement`. Ceci est intentionnel : la confiance est acquise grâce à des preuves répétées de défaillances simulées, et non supposée.
157
+
158
+ Les rapports de calibrage se trouvent dans le répertoire `calibration/reviewer-profiles/<profil>/seeded-v1.{json,md`. Chaque rapport enregistre les résultats PASS/FAIL pour sept critères, quatre étiquettes de statut (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), et indique honnêtement ce que le test ne peut pas vérifier (`needs_contradiction_mapping` est inaccessible depuis `seeded-v1`). Consultez [CHANGELOG.md](CHANGELOG.md).
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ Lorsque l'option `--runs <n>` est utilisée, les rapports pour chaque exécution sont écrits dans `<profil>/runs/run-NNN.json`, et un rapport agrégé (avec des critères basés sur la médiane et la détection des défaillances récurrentes) est écrit dans `<profil>/seeded-v1.{json,md`. Le rapport agrégé contient `receipt_kind: 'aggregate'` pour le distinguer des rapports d'une seule exécution. Le mode d'exécution unique (`--runs 1` ou omis) conserve le comportement d'écriture directe existant.
172
+
152
173
  ## Statut
153
174
 
175
+ **v0.5.0** — publié sur npm en tant que `@mcptoolshop/research-os@0.5.0`, le 10 mai 2026. La version 0.5.0 rend le calibrage des évaluateurs plus fiable. Un profil d'évaluateur n'est pas considéré comme fiable simplement parce qu'il a été exécuté une fois ; il acquiert un statut grâce à des rapports structurés de défaillances simulées et à une agrégation de plusieurs exécutions. Inclut : schéma de rapport de calibrage structuré (`seeded-v1.{json,md}`, validé par Zod, quatre étiquettes de statut) ; environnement d'exécution multi-exécutions (`--runs <n>`, isolation par exécution, critères PASS/FAIL basés sur la médiane, dégradation en cas de défaillances récurrentes) ; critère de vocabulaire de décision conscient de l'architecture ; recherche de rapports relative au paquet dans `review-promote`. **Aucune "baseline de confiance" admise :** `hermes-two-pass=échec` (agrégé, 3 exécutions), `mistral-nemo-two-pass=succès conditionnel`, `hermes-single-pass=comparaison uniquement`. research-os peut désormais refuser de faire confiance à un profil d'évaluateur lorsque les défaillances simulées répétées ne justifient pas la confiance. **Aucun changement concernant les "gates", les "freezes" ou les "lois de synthèse". Les quatre paquets figés vérifient l'identité des octets.** 671/671 tests vitest réussis. Consultez [CHANGELOG.md](CHANGELOG.md).
176
+
154
177
  **v0.4.0** — publié sur npm sous le nom `@mcptoolshop/research-os@0.4.0`, le 10 mai 2026. La version 0.4.0 assure la pérennité de l'identité de la source. Les règles de type de source déterministes gèrent la majorité reproductible, les registres de remplacement préservent les corrections de l'opérateur lors des nouvelles agrégations, et la commande `source-card audit` remplace les vérifications de dérive des scripts temporaires par une interface CLI de première classe. Comprend : un classificateur de type de source centralisé (Composant B — `classifySourceType`, 11 fournisseurs canoniques, `source-type-rules.json`); un registre de remplacement de carte source (Composant A — `source-card-overrides.jsonl`, commandes `validate` et `list`); et une CLI d'audit de carte source (Composant D — `research-os source-card audit --pack <dir>`, 7 types de résultats, artefacts JSON et Markdown, options `--apply --from` pour spécifier le chemin). Correction cosmétique F-46 : les manifestes de pack indiquent désormais la version binaire en cours d'utilisation plutôt que la version figée dans `research.yaml` lors de l'initialisation du pack. **Aucun changement concernant les mécanismes de contrôle, de blocage ou les lois de synthèse. Les quatre packs figés existants sont vérifiés de manière identique au niveau des octets.** 620/620 tests vitest réussis. Consultez le fichier [CHANGELOG.md](CHANGELOG.md) et la page du manuel d'audit de carte source : [https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
155
178
 
156
179
  **v0.3.3** — publié sur npm sous le nom `@mcptoolshop/research-os@0.3.3`, le 10 mai 2026. Comprend une clarification de la sémantique des contrôles, obtenue grâce au Pack-3 (durabilité de l'exportation/de l'exécution Godot, Expérience 3, pack n° 3 sur 3). La sortie du contrôle indique désormais les nombres de publications et de comptes primaires spécifiques à chaque section, en plus des nombres globaux du pack (F-43) ; le message `no_source_cluster_monopoly` a été modifié de "AVERTISSEMENT" à un diagnostic informatif (F-41). **Le comportement de réussite/échec n'a pas changé ; les packs figés existants sont vérifiés de manière identique au niveau des octets.** 570/570 tests vitest réussis. Consultez le fichier [CHANGELOG.md](CHANGELOG.md) et le fichier [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
package/README.hi.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,46 @@ discover
149
149
 
150
150
  `research-os` एक स्थानीय-प्रथम कमांड-लाइन इंटरफेस है। यह उन शोध-पैकेज निर्देशिकाओं में फ़ाइलों को पढ़ता और लिखता है जिन्हें आप निर्दिष्ट करते हैं, और (जब `gather` का उपयोग किया जाता है) स्रोत यूआरएल प्राप्त करने के लिए बाहरी HTTP अनुरोध भेजता है जिन्हें आप प्रदान करते हैं। यह निम्नलिखित नहीं करता है: कोई सर्वर नहीं चलाता, इनकमिंग कनेक्शन स्वीकार नहीं करता, क्रेडेंशियल संग्रहीत नहीं करता, या टेलीमेट्री नहीं भेजता। किसी भी गुप्त जानकारी को पैकेज फ़ाइलों में नहीं लिखा जाता है। भेद्यता रिपोर्टिंग नीति के लिए [SECURITY.md](SECURITY.md) देखें।
151
151
 
152
+ ## समीक्षक कैलिब्रेशन
153
+
154
+ v0.5.0 समीक्षक कैलिब्रेशन को अधिक टिकाऊ बनाता है। किसी समीक्षक प्रोफाइल पर इसलिए भरोसा नहीं किया जाता क्योंकि
155
+ यह केवल एक बार चलाया गया था; यह संरचित, पूर्वनिर्धारित विफलता रिपोर्टों और
156
+ एकाधिक रनों के संयोजन के माध्यम से एक स्थिति प्राप्त करता है।
157
+
158
+ **वर्तमान में कोई भी प्रोफाइल `trusted_baseline` के रूप में स्वीकार नहीं किया गया है।** रिपॉजिटरी में मौजूद मानक रिपोर्टें
159
+ `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`,
160
+ `hermes-single-pass=comparison_only` दिखाती हैं। यह जानबूझकर किया गया है: विश्वास
161
+ बार-बार होने वाले पूर्वनिर्धारित विफलता के प्रमाणों के माध्यम से अर्जित किया जाता है, न कि केवल अनुमान लगाकर।
162
+
163
+ कैलिब्रेशन रिपोर्ट `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}` पर उपलब्ध हैं।
164
+ प्रत्येक रिपोर्ट में सात मानदंडों के विरुद्ध PASS/FAIL दर्ज किया जाता है, चार स्थिति लेबल
165
+ (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), और
166
+ यह ईमानदारी से बताता है कि कौन सा परीक्षण मामला (फिक्स्चर) परीक्षण करने में सक्षम नहीं है (`needs_contradiction_mapping`
167
+ `seeded-v1` से दुर्गम है)। [CHANGELOG.md](CHANGELOG.md) देखें।
168
+
169
+ ```bash
170
+ # Single-run calibration (quick local check)
171
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
172
+
173
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
174
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
175
+
176
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
177
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
178
+ ```
179
+
180
+ जब `--runs <n>` का उपयोग किया जाता है, तो प्रत्येक रन के लिए रिपोर्ट `<profile>/runs/run-NNN.json` में लिखी जाती हैं
181
+ और एक एकत्रित रिपोर्ट (माध्य-आधारित मानदंडों और आवर्ती विफलता का पता लगाने के साथ)
182
+ `<profile>/seeded-v1.{json,md}` में लिखी जाती है। एकत्रित रिपोर्ट में `receipt_kind: 'aggregate'` होता है
183
+ जो इसे एकल-रन रिपोर्टों से अलग करता है। एकल-रन मोड (`--runs 1` या छोड़ा गया)
184
+ मौजूदा सीधे लिखने के व्यवहार को बनाए रखता है।
185
+
152
186
  ## स्थिति
153
187
 
188
+ **v0.5.0** — npm पर `@mcptoolshop/research-os@0.5.0` के रूप में प्रकाशित, 2026-05-10। v0.5.0 समीक्षक कैलिब्रेशन को अधिक टिकाऊ बनाता है। किसी समीक्षक प्रोफाइल पर इसलिए भरोसा नहीं किया जाता क्योंकि
189
+ यह केवल एक बार चलाया गया था; यह संरचित, पूर्वनिर्धारित विफलता रिपोर्टों और
190
+ एकाधिक रनों के संयोजन के माध्यम से एक स्थिति प्राप्त करता है। इसमें शामिल हैं: संरचित कैलिब्रेशन रिपोर्ट स्कीमा (`seeded-v1.{json,md}`, Zod द्वारा सत्यापित, चार स्थिति लेबल); बहु-रन प्रणाली (`--runs <n>`, प्रति-रन अलगाव, माध्य-आधारित PASS/FAIL मानदंड, आवर्ती विफलता का पता लगाना); आर्किटेक्चर-जागरूक निर्णय शब्दावली मानदंड; `review-promote` में पैकेज-सापेक्ष रिपोर्ट खोज। **कोई भी विश्वसनीय आधारभूत स्वीकार नहीं किया गया:** `hermes-two-pass=failed` (एकत्रित, 3 रन), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. research-os अब एक समीक्षक प्रोफाइल पर भरोसा करने से इनकार कर सकता है जब बार-बार होने वाली पूर्वनिर्धारित विफलताएं विश्वास का समर्थन नहीं करती हैं। **कोई गेट, फ्रीज या संश्लेषण-कानून परिवर्तन नहीं। सभी चार स्थिर पैकेजों को बाइट-समान रूप से सत्यापित किया गया है।** 671/671 vitest पास हो गया। [CHANGELOG.md](CHANGELOG.md) देखें।
191
+
154
192
  **v0.4.0** — npm पर `@mcptoolshop/research-os@0.4.0` के रूप में प्रकाशित, 2026-05-10। v0.4.0 स्रोत की पहचान को स्थायी बनाता है। नियतात्मक स्रोत-प्रकार नियम दोहराए जा सकने वाले अधिकांश मामलों को संभालते हैं, ओवरराइड लेजर ऑपरेटर द्वारा किए गए सुधारों को पुनः-संग्रहण के दौरान संरक्षित करते हैं, और `source-card audit` एक समर्पित कमांड-लाइन इंटरफेस के साथ "स्क्रैच-स्क्रिप्ट" विचलन जांचों को बदलता है। इसमें शामिल हैं: केंद्रीकृत स्रोत-प्रकार वर्गीकरण (घटक B — `classifySourceType`, 11 मानक विक्रेता, `source-type-rules.json`); स्रोत-कार्ड ओवरराइड लेजर (घटक A — `source-card-overrides.jsonl`, `validate` + `list` उप-कमांड); और स्रोत-कार्ड ऑडिट कमांड-लाइन इंटरफेस (घटक D — `research-os source-card audit --pack <dir>`, 7 प्रकार की खोज, JSON + Markdown प्रारूप, `--apply --from` पथ लागू करें)। F-46: एक सौंदर्य संबंधी सुधार: अब "पैक" मैनिफेस्ट में `research.yaml` में "पैक" शुरू करने के समय "फ्रीज" किए गए संस्करण के बजाय, लाइव बाइनरी संस्करण अंकित किया जाता है। **कोई "गेट", "फ्रीज" या "सिंथेसिस-लॉ" परिवर्तन नहीं। सभी चार मौजूदा "फ्रीज" किए गए "पैक" "वेरिफाई-पैक" बाइट-समान रूप से पास होते हैं।** 620/620 "विटेस्ट" पास। [CHANGELOG.md](CHANGELOG.md) और [स्रोत-कार्ड ऑडिट हैंडबुक पृष्ठ](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/) देखें।
155
193
 
156
194
  **v0.3.3** — npm पर `@mcptoolshop/research-os@0.3.3` के रूप में प्रकाशित, 2026-05-10। "पैक-3" (Godot निर्यात/रनटाइम स्थिरता, प्रयोग 3 का "पैक" #3) द्वारा अर्जित "गेट" अर्थ स्पष्टता शामिल है। "गेट" आउटपुट अब "पैक" के स्तर के साथ-साथ अनुभाग-स्तरीय प्रकाशक और प्राथमिक गणनाएं भी दिखाता है (F-43); `no_source_cluster_monopoly` को चेतावनी से बदलकर सूचनात्मक निदान में बदल दिया गया है (F-41)। **पास/फेल व्यवहार अपरिवर्तित है; मौजूदा "फ्रीज" किए गए "पैक" "वेरिफाई-पैक" बाइट-समान रूप से पास होते हैं।** 570/570 "विटेस्ट" पास। [CHANGELOG.md](CHANGELOG.md) और [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md) देखें।
package/README.it.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,9 +149,32 @@ Questa è l'alternativa strutturale a *ricerca → riepilogo → report dettagli
149
149
 
150
150
  `research-os` è un'interfaccia a riga di comando (CLI) locale. Legge e scrive file all'interno della directory del pacchetto di ricerca a cui la si indica e, quando si utilizza la funzione "gather", effettua richieste HTTP in uscita per recuperare gli URL delle fonti fornite. Non esegue un server, non accetta connessioni in entrata, non memorizza credenziali né invia dati di telemetria. Nessun segreto viene scritto negli artefatti del pacchetto. Consultare [SECURITY.md](SECURITY.md) per la politica di segnalazione delle vulnerabilità.
151
151
 
152
+ ## Calibrazione dei revisori
153
+
154
+ La versione 0.5.0 rende la calibrazione dei revisori più affidabile. Un profilo di revisore non è considerato affidabile solo perché è stato eseguito una volta; acquisisce uno stato attraverso ricevute strutturate che segnalano errori simulati e aggregazioni di esecuzioni multiple.
155
+
156
+ **Nessun profilo è attualmente considerato come "baseline affidabile".** Le ricevute di riferimento nel repository mostrano `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. Questo è intenzionale: l'affidabilità si guadagna attraverso prove ripetute di errori simulati, non viene data per scontata.
157
+
158
+ Le ricevute di calibrazione si trovano in `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`. Ogni ricevuta registra i risultati PASS/FAIL rispetto a sette criteri, quattro etichette di stato (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), e indica onestamente cosa il test non può verificare (`needs_contradiction_mapping` non è raggiungibile da `seeded-v1`). Consultare [CHANGELOG.md](CHANGELOG.md).
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ Quando si utilizza l'opzione `--runs <n>`, le ricevute per ogni esecuzione vengono scritte in `<profile>/runs/run-NNN.json` e una ricevuta aggregata (con barre basate sulla mediana e rilevamento di errori ricorrenti) viene scritta in `<profile>/seeded-v1.{json,md}`. La ricevuta aggregata contiene `receipt_kind: 'aggregate'` per distinguerla dalle ricevute di singola esecuzione. La modalità di singola esecuzione (`--runs 1` o omessa) mantiene il comportamento esistente di scrittura diretta.
172
+
152
173
  ## Stato
153
174
 
154
- **v0.4.0** — Pubblicata su npm come `@mcptoolshop/research-os@0.4.0` il 10 maggio 2026. La versione 0.4.0 garantisce la persistenza dell'identità della sorgente. Le regole deterministiche per il tipo di sorgente gestiscono la ripetibilità, i registri di sovrascrittura preservano le correzioni dell'operatore durante le nuove aggregazioni, e il comando `source-card audit` sostituisce i controlli di deriva degli script con un'interfaccia CLI dedicata. Include: un classificatore centralizzato per il tipo di sorgente (Componente B — `classifySourceType`, 11 fornitori standard, `source-type-rules.json`); un registro di sovrascrittura per le sorgenti (Componente A `source-card-overrides.jsonl`, comandi secondari `validate` e `list`); e un'interfaccia CLI per l'audit delle sorgenti (Componente D `research-os source-card audit --pack <dir>`, 7 tipi di anomalie, artefatti JSON e Markdown, opzioni `--apply --from` per l'applicazione del percorso). Correzione estetica F-46: i manifest dei pacchetti ora indicano la versione binaria effettiva, anziché la versione "congelata" in `research.yaml` durante la creazione del pacchetto. **Nessuna modifica alle funzionalità di controllo, "freeze" o alle regole di sintesi. Tutti e quattro i pacchetti esistenti vengono verificati byte per byte.** 620 test superati su 620 con vitest. Consultare il file [CHANGELOG.md](CHANGELOG.md) e la pagina del manuale sull'audit delle sorgenti: [https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
175
+ **v0.5.0** — pubblicata su npm come `@mcptoolshop/research-os@0.5.0`, 10 maggio 2026. La versione 0.5.0 rende la calibrazione dei revisori più affidabile. Un profilo di revisore non è considerato affidabile solo perché è stato eseguito una volta; acquisisce uno stato attraverso ricevute strutturate che segnalano errori simulati e aggregazioni di esecuzioni multiple. Include: schema di ricevuta di calibrazione strutturato (`seeded-v1.{json,md}`, convalidato da Zod, quattro etichette di stato); meccanismo di esecuzione multi-run (`--runs <n>`, isolamento per esecuzione, barre PASS/FAIL basate sulla mediana, demotivazione per errori ricorrenti); barra di vocabolario decisionale consapevole dell'architettura; ricerca di ricevute relativa al pacchetto in `review-promote`. **Nessuna baseline affidabile accettata:** `hermes-two-pass=failed` (aggregata, 3 esecuzioni), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. research-os può ora rifiutare di considerare affidabile un profilo di revisore quando ripetuti errori simulati non supportano l'affidabilità. **Nessuna modifica alle gate, al freeze o alle leggi di sintesi. Tutti e quattro i pacchetti esistenti verificano l'integrità dei byte.** 671/671 test vitest superati. Consultare [CHANGELOG.md](CHANGELOG.md).
176
+
177
+ **v0.4.0** — pubblicata su npm come `@mcptoolshop/research-os@0.4.0`, 10 maggio 2026. La versione 0.4.0 rende l'identità della sorgente più affidabile. Le regole deterministiche del tipo di sorgente gestiscono la maggioranza ripetibile, i ledger di override preservano le correzioni dell'operatore durante il ri-raccolta, e l'audit della "source-card" sostituisce i controlli di deriva degli script con un'interfaccia CLI dedicata. Include: classificatore centralizzato del tipo di sorgente (Componente B — `classifySourceType`, 11 fornitori canonici, `source-type-rules.json`); ledger di override della source-card (Componente A — `source-card-overrides.jsonl`, comandi `validate` e `list`); e CLI di audit della source-card (Componente D — `research-os source-card audit --pack <dir>`, 7 tipi di rilevamento, artefatti JSON + Markdown, opzioni `--apply --from` per l'applicazione). Correzione cosmetica F-46: i manifest dei pacchetti ora stampano la versione binaria corrente anziché la versione congelata in `research.yaml` durante l'inizializzazione del pacchetto. **Nessuna modifica alle gate, al freeze o alle leggi di sintesi. Tutti e quattro i pacchetti esistenti verificano l'integrità dei byte.** 620/620 test vitest superati. Consultare [CHANGELOG.md](CHANGELOG.md) e la [pagina del manuale dell'audit della source-card](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
155
178
 
156
179
  **v0.3.3** — Pubblicata su npm come `@mcptoolshop/research-os@0.3.3` il 10 maggio 2026. Include miglioramenti nella chiarezza delle semantiche delle "gate", ottenuti grazie al Pack-3 (durabilità dell'esportazione/runtime di Godot, Esperimento 3, pacchetto n. 3 su 3). L'output della "gate" ora include il publisher e i conteggi specifici della sezione, oltre ai conteggi globali del pacchetto (F-43); la dicitura di `no_source_cluster_monopoly` è stata modificata da AVVISO a diagnostica informativa (F-41). **Il comportamento di successo/fallimento rimane invariato; i pacchetti esistenti vengono verificati byte per byte.** 570 test vitest su 570 superati. Consultare [CHANGELOG.md](CHANGELOG.md) e [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
157
180
 
package/README.ja.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,31 @@ discover
149
149
 
150
150
  `research-os` は、ローカル環境で動作するCLIです。このツールは、指定された研究パッケージのディレクトリ内のファイルを読み書きし、`gather` コマンドを使用する場合、提供されたソースコードのURLから情報を取得するために、HTTPリクエストを送信します。このツールは、サーバーを起動したり、外部からの接続を受け付けたり、認証情報を保存したり、テレメトリデータを送信したりすることはありません。また、機密情報はパッケージのファイルに書き込まれません。脆弱性に関する報告については、[SECURITY.md](SECURITY.md) を参照してください。
151
151
 
152
+ ## レビュー担当者の評価調整機能
153
+
154
+ v0.5.0では、レビュー担当者の評価調整機能がより安定しました。レビュー担当者のプロファイルは、単に一度実行されたというだけで信頼されるわけではありません。構造化されたテストケースの失敗結果と、複数回の実行結果の集計によって、信頼度合いが評価されます。
155
+
156
+ **現在、どのプロファイルも`trusted_baseline`として認められていません。** リポジトリ内の標準的なテスト結果では、`hermes-two-pass`が`failed`、`mistral-nemo-two-pass`が`conditional_pass`、`hermes-single-pass`が`comparison_only`となっています。これは意図的なものです。信頼は、単なる仮定ではなく、繰り返されるテストケースの失敗結果によって獲得されます。
157
+
158
+ 評価結果は、`calibration/reviewer-profiles/<プロファイル名>/seeded-v1.{json,md}`に保存されています。各評価結果は、7つの項目に対する合否、4つのステータスラベル(`trusted_baseline`、`conditional_pass`、`failed`、`comparison_only`)、およびテストが実行できない状況を正直に報告します(`needs_contradiction_mapping`は`seeded-v1`からは到達できません)。詳細は[CHANGELOG.md](CHANGELOG.md)を参照してください。
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ `--runs <n>`オプションを使用すると、各実行結果が`<プロファイル名>/runs/run-NNN.json`に保存され、集計結果(中央値に基づいた合否判定と、繰り返し発生するエラーの検出を含む)が`<プロファイル名>/seeded-v1.{json,md}`に保存されます。集計結果には、`receipt_kind: 'aggregate'`という情報が含まれており、これにより単一実行の結果と区別されます。単一実行モード(`--runs 1`または省略)では、既存の直接書き込み動作が維持されます。
172
+
152
173
  ## ステータス
153
174
 
175
+ **v0.5.0** — npmで`@mcptoolshop/research-os@0.5.0`として公開。2026年5月10日。v0.5.0では、レビュー担当者の評価調整機能がより安定しました。レビュー担当者のプロファイルは、単に一度実行されたというだけで信頼されるわけではありません。構造化されたテストケースの失敗結果と、複数回の実行結果の集計によって、信頼度合いが評価されます。変更点:構造化された評価結果スキーマ(`seeded-v1.{json,md}`、Zodによる検証、4つのステータスラベル)、複数実行機能(`--runs <n>`、各実行の分離、中央値に基づいた合否判定、繰り返し発生するエラーの検出)、アーキテクチャを考慮した判定語彙、`review-promote`におけるパッケージ相対パスでの評価結果参照。**信頼できるベースラインは存在しません:** `hermes-two-pass=failed`(集計、3回の実行)、`mistral-nemo-two-pass=conditional_pass`、`hermes-single-pass=comparison_only`。research-osは、繰り返し発生するテストケースの失敗結果が信頼を裏付けない場合、レビュー担当者のプロファイルを信頼しないようにすることができます。**ゲート、フリーズ、または合成ルールに関する変更はありません。すべての4つのフリーズされたパッケージが、バイト単位で完全に同一であることを確認しました。** 671/671のvitestが成功しました。詳細は[CHANGELOG.md](CHANGELOG.md)を参照してください。
176
+
154
177
  **v0.4.0** — npmに`@mcptoolshop/research-os@0.4.0`として公開。2026年5月10日。v0.4.0では、ソースの同一性を維持できるようになりました。決定論的なソースタイプルールにより、再現可能な多数が処理され、オーバーライドされたレジャーにより、再収集時のオペレーターによる修正が保持され、`source-card audit`コマンドが、従来のスクリプトのずれチェックを置き換え、より使いやすいCLIインターフェースを提供します。同梱内容:集中型のソースタイプ分類器(コンポーネントB — `classifySourceType`、11種類のベンダー、`source-type-rules.json`)、ソースカードのオーバーライドレジャー(コンポーネントA — `source-card-overrides.jsonl`、`validate`および`list`サブコマンド)、およびソースカード監査CLI(コンポーネントD — `research-os source-card audit --pack <dir>`、7種類の検出結果、JSONおよびMarkdown形式のレポート、`--apply --from`による適用パス)。F-46:見た目の修正。パッケージのマニフェストには、`research.yaml`に固定されたバージョンではなく、実行中のバイナリのバージョンが記録されるようになりました。**ゲート、フリーズ、または合成に関する変更はありません。既存のすべてのパッケージは、バイト単位で同一であることを検証済みです。** 620/620のvitestテストが合格しました。詳細は[CHANGELOG.md](CHANGELOG.md)および[ソースカード監査に関するハンドブック](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/)を参照してください。
155
178
 
156
179
  **v0.3.3** — npmに`@mcptoolshop/research-os@0.3.3`として公開。2026年5月10日。Pack-3(Godotのエクスポート/ランタイムの安定性、実験3のパッケージ#3)によって得られた、ゲートのセマンティクスに関する明確化が含まれています。ゲートの出力には、セクションごとのパブリッシャー数と主要なカウントに加えて、パッケージ全体のカウントも表示されます(F-43)。`no_source_cluster_monopoly`は、警告から情報診断に変更されました(F-41)。**合格/不合格の動作は変更されていません。既存のパッケージは、バイト単位で同一であることを検証済みです。** 570/570のvitestテストが合格しました。詳細は[CHANGELOG.md](CHANGELOG.md)および[`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md)を参照してください。
package/README.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,44 @@ This is the structural alternative to *search → summarize → pretty report*.
149
149
 
150
150
  `research-os` is a local-first CLI. It reads and writes files within the research-pack directory you point it at, and (when using `gather`) issues outbound HTTP requests to fetch source URLs you provide. It does not: run a server, accept inbound connections, store credentials, or send telemetry. No secrets are written to pack artifacts. See [SECURITY.md](SECURITY.md) for the vulnerability reporting policy.
151
151
 
152
+ ## Reviewer calibration
153
+
154
+ v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because
155
+ it ran once; it earns a status through structured seeded-failure receipts and
156
+ multi-run aggregation.
157
+
158
+ **No profile is currently admitted as `trusted_baseline`.** The canonical receipts
159
+ in the repo show `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`,
160
+ `hermes-single-pass=comparison_only`. This is intentional: trust is earned through
161
+ repeated seeded-failure evidence, not assumed.
162
+
163
+ Calibration receipts live at `calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`.
164
+ Each receipt records PASS/FAIL against seven bars, four status labels
165
+ (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), and
166
+ honestly discloses what the fixture cannot test (`needs_contradiction_mapping`
167
+ is unreachable from `seeded-v1`). See [CHANGELOG.md](CHANGELOG.md).
168
+
169
+ ```bash
170
+ # Single-run calibration (quick local check)
171
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
172
+
173
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
174
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
175
+
176
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
177
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
178
+ ```
179
+
180
+ When `--runs <n>` is used, per-run receipts are written to `<profile>/runs/run-NNN.json`
181
+ and an aggregate receipt (with median-based bars and recurring-failure detection) is written
182
+ to `<profile>/seeded-v1.{json,md}`. The aggregate receipt carries `receipt_kind: 'aggregate'`
183
+ to discriminate from single-run receipts. Single-run mode (`--runs 1` or omitted) preserves
184
+ the existing direct-write behavior.
185
+
152
186
  ## Status
153
187
 
188
+ **v0.5.0** — published to npm as `@mcptoolshop/research-os@0.5.0`, 2026-05-10. v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because it ran once; it earns a status through structured seeded-failure receipts and multi-run aggregation. Ships: structured calibration receipt schema (`seeded-v1.{json,md}`, Zod-validated, four status labels); multi-run harness (`--runs <n>`, per-run isolation, median-based PASS/FAIL bars, recurring-failure demotion); architecture-aware decision-vocab bar; pack-relative receipt lookup in `review-promote`. **No trusted baseline admitted:** `hermes-two-pass=failed` (aggregate, 3 runs), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. research-os can now refuse to trust a reviewer profile when repeated seeded failures do not support trust. **No gate, freeze, or synthesis-law changes. All four frozen packs verify-pack byte-identically.** 671/671 vitest passing. See [CHANGELOG.md](CHANGELOG.md).
189
+
154
190
  **v0.4.0** — published to npm as `@mcptoolshop/research-os@0.4.0`, 2026-05-10. v0.4.0 makes source identity durable. Deterministic source-type rules handle the repeatable majority, override ledgers preserve operator corrections across re-gather, and `source-card audit` replaces scratch-script drift checks with a first-class CLI surface. Ships: centralized source-type classifier (Component B — `classifySourceType`, 11 canonical vendors, `source-type-rules.json`); source-card override ledger (Component A — `source-card-overrides.jsonl`, `validate` + `list` subcommands); and source-card audit CLI (Component D — `research-os source-card audit --pack <dir>`, 7 finding kinds, JSON + Markdown artifacts, `--apply --from` apply path). F-46 cosmetic fix: pack manifests now stamp the live binary version rather than the version frozen into `research.yaml` at pack-init. **No gate, freeze, or synthesis-law changes. All four existing frozen packs verify-pack byte-identically.** 620/620 vitest passing. See [CHANGELOG.md](CHANGELOG.md) and the [source-card audit handbook page](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
155
191
 
156
192
  **v0.3.3** — published to npm as `@mcptoolshop/research-os@0.3.3`, 2026-05-10. Ships gate-semantics clarity earned by Pack-3 (Godot export/runtime durability, Experiment 3 pack #3 of 3). Gate output now carries section-scoped publisher + primary counts alongside pack-wide counts (F-43); `no_source_cluster_monopoly` reworded from WARN to informational diagnostic (F-41). **Pass/fail behavior unchanged; existing frozen packs verify-pack byte-identically.** 570/570 vitest passing. See [CHANGELOG.md](CHANGELOG.md) and [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
package/README.pt-BR.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,8 +149,31 @@ Esta é a alternativa estrutural para *pesquisar → resumir → gerar relatóri
149
149
 
150
150
  `research-os` é uma ferramenta de linha de comando que opera localmente. Ela lê e grava arquivos dentro do diretório do pacote de pesquisa que você especificar e, quando usa o comando `gather`, faz solicitações HTTP para buscar URLs de origem que você fornecer. Ela não: executa um servidor, aceita conexões de entrada, armazena credenciais ou envia dados de telemetria. Nenhum segredo é gravado nos arquivos do pacote. Consulte [SECURITY.md](SECURITY.md) para a política de relatório de vulnerabilidades.
151
151
 
152
+ ## Calibração de revisores
153
+
154
+ A versão v0.5.0 torna a calibração de revisores mais robusta. Um perfil de revisor não é considerado confiável apenas porque foi executado uma vez; ele adquire um status através de relatórios estruturados de falhas simuladas e agregação de múltiplas execuções.
155
+
156
+ **Atualmente, nenhum perfil é considerado como "baseline confiável".** Os relatórios canônicos no repositório mostram `hermes-two-pass=failed`, `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. Isso é intencional: a confiança é conquistada através de evidências repetidas de falhas simuladas, e não é presumida.
157
+
158
+ Os relatórios de calibração estão localizados em `calibration/reviewer-profiles/<perfil>/seeded-v1.{json,md}`. Cada relatório registra PASS/FAIL em relação a sete critérios, quatro rótulos de status (`trusted_baseline`, `conditional_pass`, `failed`, `comparison_only`), e revela honestamente o que o teste não consegue verificar (`needs_contradiction_mapping` é inacessível a partir de `seeded-v1`). Consulte [CHANGELOG.md](CHANGELOG.md).
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ Quando `--runs <n>` é usado, os relatórios de cada execução são gravados em `<perfil>/runs/run-NNN.json` e um relatório agregado (com critérios baseados na mediana e detecção de falhas recorrentes) é gravado em `<perfil>/seeded-v1.{json,md}`. O relatório agregado contém `receipt_kind: 'aggregate'` para diferenciá-lo dos relatórios de execução única. O modo de execução única (`--runs 1` ou omitido) preserva o comportamento de gravação direta existente.
172
+
152
173
  ## Status
153
174
 
175
+ **v0.5.0** — publicado no npm como `@mcptoolshop/research-os@0.5.0`, 10 de maio de 2026. A versão v0.5.0 torna a calibração de revisores mais robusta. Um perfil de revisor não é considerado confiável apenas porque foi executado uma vez; ele adquire um status através de relatórios estruturados de falhas simuladas e agregação de múltiplas execuções. Inclui: esquema de relatório de calibração estruturado (`seeded-v1.{json,md}`, validado com Zod, quatro rótulos de status); sistema de execução de múltiplas execuções (`--runs <n>`, isolamento por execução, critérios PASS/FAIL baseados na mediana, detecção de falhas recorrentes); critério de avaliação baseado na arquitetura; pesquisa de relatórios relativa ao pacote em `review-promote`. **Nenhum baseline confiável admitido:** `hermes-two-pass=failed` (agregado, 3 execuções), `mistral-nemo-two-pass=conditional_pass`, `hermes-single-pass=comparison_only`. O research-os agora pode recusar a confiança em um perfil de revisor quando falhas simuladas repetidas não suportam a confiança. **Nenhuma alteração nos gates, congelamentos ou leis de síntese. Todos os quatro pacotes congelados verificam a identidade dos bytes.** 671/671 testes vitest aprovados. Consulte [CHANGELOG.md](CHANGELOG.md).
176
+
154
177
  **v0.4.0** — Publicada no npm como `@mcptoolshop/research-os@0.4.0`, 10 de maio de 2026. A versão 0.4.0 garante a durabilidade da identidade da fonte. Regras determinísticas para o tipo de fonte lidam com a maioria repetível, os registros de substituição preservam as correções do operador durante a re-coleta, e o comando `source-card audit` substitui as verificações de derivação de scripts por uma interface de linha de comando (CLI) completa. Inclui: um classificador centralizado de tipo de fonte (Componente B — `classifySourceType`, 11 fornecedores padrão, `source-type-rules.json`); um registro de substituição de cartão de fonte (Componente A — `source-card-overrides.jsonl`, subcomandos `validate` e `list`); e uma CLI para auditoria de cartão de fonte (Componente D — `research-os source-card audit --pack <dir>`, 7 tipos de detecção, artefatos JSON + Markdown, opções `--apply --from` para aplicar o caminho). Correção estética F-46: os arquivos de manifesto agora indicam a versão binária em execução, em vez da versão fixada no arquivo `research.yaml` durante a inicialização da criação do pacote. **Não há alterações nas regras de validação, congelamento ou síntese. Todos os quatro pacotes existentes passam na verificação de integridade byte a byte.** 620/620 testes vitest aprovados. Consulte o arquivo [CHANGELOG.md](CHANGELOG.md) e a página do manual de auditoria de cartão de fonte: [https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/).
155
178
 
156
179
  **v0.3.3** — Publicado no npm como `@mcptoolshop/research-os@0.3.3`, 10 de maio de 2026. Inclui melhorias na clareza da semântica das "gates" obtidas com o Pack-3 (durabilidade da exportação/runtime do Godot, Experimento 3, pacote nº 3 de 3). A saída da "gate" agora inclui contadores específicos da seção, além dos contadores globais (F-43); a mensagem `no_source_cluster_monopoly` foi alterada de um aviso para um diagnóstico informativo (F-41). **O comportamento de aprovação/reprovação não foi alterado; os pacotes congelados existentes são verificados byte a byte.** 570/570 testes do vitest passaram. Consulte o arquivo [CHANGELOG.md](CHANGELOG.md) e o arquivo [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md).
package/README.zh.md CHANGED
@@ -7,7 +7,7 @@
7
7
  </p>
8
8
 
9
9
  <p align="center">
10
- <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.4.0"><img src="https://img.shields.io/badge/version-0.4.0-blue" alt="version 0.4.0"></a>
10
+ <a href="https://github.com/mcp-tool-shop-org/research-os/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/version-0.5.0-blue" alt="version 0.5.0"></a>
11
11
  <a href="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/research-os/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
12
12
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a>
13
13
  <img src="https://img.shields.io/badge/node-%E2%89%A520-brightgreen" alt="Node ≥20">
@@ -149,9 +149,32 @@ discover
149
149
 
150
150
  `research-os` 是一个本地优先的命令行工具。它在您指定的“研究包”目录中读取和写入文件,并在使用 `gather` 命令时,会向外部发送 HTTP 请求以获取您提供的来源 URL。它不会:运行服务器、接受传入连接、存储凭据或发送遥测数据。任何敏感信息都不会写入到包文件中。请参阅 [SECURITY.md](SECURITY.md),了解漏洞报告政策。
151
151
 
152
+ ## 评审员校准
153
+
154
+ v0.5.0版本使评审员校准更加可靠。评审员配置文件不会因为只运行一次而被信任,而是通过结构化的、带有预设错误的测试结果和多次运行的聚合来获得信任状态。
155
+
156
+ **目前没有任何配置文件被认为是`trusted_baseline`(可信基线)。** 仓库中的标准测试结果显示`hermes-two-pass=failed`(失败),`mistral-nemo-two-pass=conditional_pass`(条件通过),`hermes-single-pass=comparison_only`(仅供比较)。这是有意为之:信任是通过反复的、带有预设错误的结果来获得的,而不是默认信任。
157
+
158
+ 校准结果文件位于`calibration/reviewer-profiles/<profile>/seeded-v1.{json,md}`。每个结果文件记录了针对七个方面的PASS/FAIL(通过/失败)结果,四个状态标签(`trusted_baseline`、`conditional_pass`、`failed`、`comparison_only`),并诚实地披露了测试框架无法测试的内容(`needs_contradiction_mapping`无法从`seeded-v1`访问)。请参阅[CHANGELOG.md](CHANGELOG.md)。
159
+
160
+ ```bash
161
+ # Single-run calibration (quick local check)
162
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
163
+
164
+ # Multi-run aggregate calibration (canonical evidence — 3 runs, median-based PASS/FAIL)
165
+ node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
166
+
167
+ # Promote a section's review — auto-populates calibration_summary from pack-relative receipt
168
+ research-os review-promote 01-section --pack <pack> --profile hermes-two-pass
169
+ ```
170
+
171
+ 当使用`--runs <n>`参数时,每个运行的结果文件会被写入到`<profile>/runs/run-NNN.json`,并且会生成一个聚合结果文件(包含基于中位数的PASS/FAIL结果,以及重复失败检测),写入到`<profile>/seeded-v1.{json,md}`。聚合结果文件包含`receipt_kind: 'aggregate'`,用于区分单次运行的结果文件。单次运行模式(`--runs 1`或省略)会保留现有的直接写入行为。
172
+
152
173
  ## 状态
153
174
 
154
- **v0.4.0** 版本已发布到 npm,包名为 `@mcptoolshop/research-os@0.4.0`,发布日期为 2026年5月10日。 v0.4.0 版本增强了源标识的持久性。 确定性的源类型规则处理可重复的大部分情况,覆盖账本保留了操作员的修正,即使在重新收集数据时也能生效,并且 `source-card audit` 命令取代了对临时脚本漂移的检查,提供了一个更完善的命令行界面。 包含内容:集中式的源类型分类器(组件 B,`classifySourceType`,11个标准供应商,`source-type-rules.json`);源卡覆盖账本(组件 A,`source-card-overrides.jsonl`,`validate` 和 `list` 子命令);以及源卡审计命令行工具(组件 D,`research-os source-card audit --pack <dir>`,7种检测类型,JSON 和 Markdown 格式的报告,`--apply --from` 参数用于指定应用路径)。 F-46:修复了外观问题,现在打包清单会记录实际的二进制版本,而不是 `research.yaml` 文件中固定的版本。 **没有对安全机制、冻结机制或合成规则进行任何更改。所有四个现有的冻结包都经过了字节级别的完全一致性验证。** 620/620 vitest 测试通过。 详情请参考 [CHANGELOG.md](CHANGELOG.md) 文件以及 [源卡审计手册](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/) 页面。
175
+ **v0.5.0** 发布到npm,版本号为`@mcptoolshop/research-os@0.5.0`,发布日期:2026-05-10v0.5.0版本使评审员校准更加可靠。评审员配置文件不会因为只运行一次而被信任,而是通过结构化的、带有预设错误的测试结果和多次运行的聚合来获得信任状态。包含:结构化的校准结果模式(`seeded-v1.{json,md}`,经过Zod验证,包含四个状态标签);多运行测试框架(`--runs <n>`,每个运行隔离,基于中位数的PASS/FAIL结果,重复失败降级);能够感知架构的决策词汇表;在`review-promote`中进行包相关的结果文件查找。**没有可信的基线:** `hermes-two-pass=failed`(聚合,3次运行),`mistral-nemo-two-pass=conditional_pass`,`hermes-single-pass=comparison_only`。research-os现在可以拒绝信任评审员配置文件,当反复的、带有预设错误的测试结果不支持信任时。**没有对网关、冻结或合成规则的更改。所有四个现有的冻结包都以字节级别的相同方式进行验证。** 671/671个vitest测试通过。请参阅[CHANGELOG.md](CHANGELOG.md)
176
+
177
+ **v0.4.0** — 发布到npm,版本号为`@mcptoolshop/research-os@0.4.0`,发布日期:2026-05-10。v0.4.0版本使源代码身份更加可靠。基于确定性的源代码类型规则处理可重复的多数情况,覆盖账本保留了操作员的更正,并且`source-card audit`(源代码卡审计)取代了对临时脚本漂移的检查,提供了一个一流的命令行界面。包含:集中式的源代码类型分类器(组件B — `classifySourceType`,11个标准供应商,`source-type-rules.json`);源代码卡覆盖账本(组件A — `source-card-overrides.jsonl`,`validate` + `list`子命令);以及源代码卡审计命令行界面(组件D — `research-os source-card audit --pack <dir>`,7种发现类型,JSON + Markdown格式,`--apply --from`用于应用路径)。F-46:一个小的修复,现在包清单会记录实际的二进制版本,而不是冻结在`research.yaml`中的版本,该版本在包初始化时被冻结。**没有对网关、冻结或合成规则的更改。所有四个现有的冻结包都以字节级别的相同方式进行验证。** 620/620个vitest测试通过。请参阅[CHANGELOG.md](CHANGELOG.md)以及[源代码卡审计手册页面](https://mcp-tool-shop-org.github.io/research-os/handbook/source-card-audit/)。
155
178
 
156
179
  **v0.3.3** — 已发布到 npm,版本号为 `@mcptoolshop/research-os@0.3.3`,发布日期:2026年5月10日。此版本改进了“门”机制的语义清晰度,这是Pack-3(Godot导出/运行时稳定性,实验3的第3个包)所取得的成果。现在,“门”的输出结果除了包含整个包的计数外,还包含按“门”划分的发布者和主要计数(F-43);`no_source_cluster_monopoly` 的警告信息已更改为信息性诊断信息(F-41)。**通过/失败的行为未改变;现有的冻结包在字节级别上进行验证。** 570/570 个 vitest 测试通过。请参阅 [CHANGELOG.md](CHANGELOG.md) 和 [`docs/section-scoped-waivers.md`](docs/section-scoped-waivers.md)。
157
180