npm - @pharmatools/opengate - Versions diffs - 0.1.0 - Mend

@pharmatools/opengate 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

package/ADAPTERS.md +145 -0
package/LICENSE +21 -0
package/README.md +200 -0
package/datasets/LABELING-GUIDE.md +65 -0
package/datasets/SCHEMA.md +17 -0
package/datasets/cases/_template.json +27 -0
package/datasets/cases/case-cardio-bp.json +23 -0
package/datasets/cases/case-cardio-hf.json +21 -0
package/datasets/cases/case-cardio-lipids.json +41 -0
package/datasets/cases/case-derm-psoriasis.json +22 -0
package/datasets/cases/case-diabetes-glp1.json +33 -0
package/datasets/cases/case-endo-thyroid.json +18 -0
package/datasets/cases/case-gastro-ibd.json +21 -0
package/datasets/cases/case-hema-anemia.json +19 -0
package/datasets/cases/case-hep-nash.json +19 -0
package/datasets/cases/case-hiv-prep.json +18 -0
package/datasets/cases/case-id-antibiotic.json +18 -0
package/datasets/cases/case-nephro-ckd.json +21 -0
package/datasets/cases/case-neuro-migraine.json +39 -0
package/datasets/cases/case-neuro-ms.json +21 -0
package/datasets/cases/case-onco-pdl1.json +21 -0
package/datasets/cases/case-oncology-pembro.json +31 -0
package/datasets/cases/case-osteo-fracture.json +19 -0
package/datasets/cases/case-pain-oa.json +21 -0
package/datasets/cases/case-psych-depression.json +19 -0
package/datasets/cases/case-pulm-ipf.json +18 -0
package/datasets/cases/case-resp-copd.json +35 -0
package/datasets/cases/case-rheum-jak.json +19 -0
package/datasets/cases/case-vaccine.json +19 -0
package/datasets/cases/case-womens-pcos.json +21 -0
package/datasets/cases/seed-teprotumumab.json +37 -0
package/datasets/fixtures/citation-styles.json +22 -0
package/opengate.http.example.json +12 -0
package/package.json +55 -0
package/src/adapters/http.mjs +132 -0
package/src/adapters/refcheckr.mjs +109 -0
package/src/index.mjs +29 -0
package/src/lib/adapter.mjs +78 -0
package/src/lib/citations.mjs +109 -0
package/src/lib/metrics.mjs +104 -0
package/src/runner.mjs +223 -0
package/src/scorers/citation-detection.mjs +69 -0
package/src/scorers/claim-extraction.mjs +75 -0
package/src/scorers/verdict-accuracy.mjs +163 -0

package/ADAPTERS.md ADDED Viewed

@@ -0,0 +1,145 @@
+# Writing an adapter
+An adapter is the boundary between OpenGATE's scorers and the system under test. Scorers never make network calls directly — they call the adapter, so evaluating a new system means writing one file (or, for REST-backed systems, none: see the generic HTTP adapter below). The bundled RefCheckr adapter (`src/adapters/refcheckr.mjs`) is the reference implementation.
+**The methodology travels; only the gold set changes** — and the adapter is how it travels.
+## Selecting an adapter
+```bash
+# default: the bundled RefCheckr adapter
+npm run eval:online
+# your own, via flag or environment variable
+node src/runner.mjs --online --adapter ./adapters/my-system.mjs
+OPENGATE_ADAPTER=./adapters/my-system.mjs npm run eval:online
+```
+The `--adapter` flag takes precedence over `OPENGATE_ADAPTER`. Relative paths resolve from the current working directory. The adapter is loaded and validated before any scorer runs — a malformed adapter fails fast with a message listing every missing or mistyped export, even on offline runs.
+## No-code option: the generic HTTP adapter
+If your system already exposes (or can expose) endpoints speaking the contract shapes below, you don't need to write code. Describe the transport in a JSON config:
+```bash
+cp opengate.http.example.json opengate.http.json   # edit paths/headers
+node src/runner.mjs --online --adapter ./src/adapters/http.mjs
+```
+```json
+{
+  "name": "my-system",
+  "baseUrl": "${MY_SYSTEM_URL}",
+  "headers": { "Authorization": "Bearer ${MY_SYSTEM_TOKEN}" },
+  "endpoints": {
+    "splitClaims": "/api/claims/split",
+    "analyzeBatch": "/api/verify/batch"
+  },
+  "modelEnv": "MY_SYSTEM_MODEL"
+}
+```
+`${VAR}` placeholders are interpolated from the environment at load time; the config path defaults to `./opengate.http.json` and can be set with `OPENGATE_HTTP_CONFIG`. Latency and token capture are built in. The HTTP adapter maps *transport, not payloads* — if your API uses different request/response shapes, write a thin code adapter instead.
+## The contract
+### Required exports
+```js
+/**
+ * Extract candidate claims from a document.
+ * @param {string} text — the source document (e.g. a manuscript section)
+ * @returns {Promise<{ claims: Array<string | { text, originalText?, citations? }> }>}
+ */
+export function splitClaims(text) {}
+/**
+ * Verify claims against reference documents.
+ * @param {object} payload
+ *   { claims: string[],
+ *     documents: [{ name, type: 'text', content }],
+ *     citationMapping: { [citation]: { refId } },
+ *     claimCitations: [{ citations: string[] }],
+ *     document_name: string }
+ * @returns {Promise<{ claims: [{ individual_analyses: [{ document_name, verdict, summary, passages }] }],
+ *                     usage?: { prompt_tokens, completion_tokens, reasoning_tokens, total_tokens } }>}
+ *   `verdict` must be one of the six values in VERDICT_SCALE (src/lib/metrics.mjs):
+ *   strong_support · partial_support · implied_by_data · overclaim · not_supported · contradicted
+ */
+export function analyzeBatch(payload) {}
+/** True when the adapter has the config it needs (URLs, tokens, …). */
+export function onlineAvailable() {}
+/** Human-readable hint shown when onlineAvailable() is false. */
+export function onlineConfigHint() {}
+```
+### Optional exports
+If absent, the loader supplies no-op defaults, so scorers can call these unconditionally. Implement them to get latency, cost, and model-comparison columns in your scorecards.
+```js
+/** Display name for the scorecard header; defaults to the filename. */
+export const meta = { name: 'my-system' };
+/** Label for which model the deployment ran (e.g. from an env var); default null. */
+export function runModel() {}
+/** Clear latency capture; called at the start of a scorer run. */
+export function resetTiming() {}
+/** Wall-clock ms for each call made since resetTiming(); default []. */
+export function callLatencies() {}
+/** Clear token capture; called at the start of a scorer run. */
+export function resetTokens() {}
+/** Aggregated token usage since resetTokens();
+ *  default { calls: 0, prompt_tokens: 0, completion_tokens: 0, reasoning_tokens: 0, total_tokens: 0 }. */
+export function tokenTotals() {}
+```
+## Minimal skeleton
+```js
+// adapters/my-system.mjs
+export const meta = { name: 'my-system' };
+const BASE = process.env.MY_SYSTEM_URL;
+export const onlineAvailable = () => Boolean(BASE);
+export const onlineConfigHint = () => 'Set MY_SYSTEM_URL to run online scorers.';
+export async function splitClaims(text) {
+  const res = await fetch(`${BASE}/extract`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({ text }),
+  });
+  const data = await res.json();
+  return { claims: data.claims }; // map your shape to the contract
+}
+export async function analyzeBatch(payload) {
+  const res = await fetch(`${BASE}/verify`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify(payload),
+  });
+  return res.json(); // must include claims[].individual_analyses[].verdict
+}
+```
+Run it:
+```bash
+OPENGATE_ADAPTER=./adapters/my-system.mjs npm run eval:online
+```
+## Notes
+- **Verdict mapping.** If your system uses a different verdict taxonomy, map it to the six-point scale inside the adapter. Adjacency accuracy assumes the scale's ordering (see `datasets/LABELING-GUIDE.md`).
+- **Gold set.** The bundled cases are biomedical. For your domain, author your own cases in `datasets/cases/` — the format spec is `datasets/SCHEMA.md`.
+- **Timing/token capture.** See the reference adapter for a pattern: wrap your HTTP helper to record per-call latency and any `usage` block the server returns.
+- **Stability.** The adapter surface may still shift pre-1.0; `validateAdapter()` in `src/lib/adapter.mjs` is the source of truth.

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Nick Lamb (PharmaTools.AI)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,200 @@
+# OpenGATE
+[![CI](https://github.com/nickjlamb/opengate/actions/workflows/ci.yml/badge.svg)](https://github.com/nickjlamb/opengate/actions/workflows/ci.yml)
+**Evidence over plausibility.**
+OpenGATE is an open-source framework for evaluating evidence-grounded AI systems — systems that must justify every answer from underlying source documents. It measures one thing above all: **can the system prove its answer from the source material?**
+As AI moves into high-stakes domains — healthcare, scientific publishing, regulatory, legal, finance — evaluation is becoming as fundamental as automated testing is in traditional software engineering. OpenGATE turns grounding failures into numbers you can track, and gates every prompt, model, or workflow change against a baseline so reliability can't quietly regress.
+Originally developed to power [RefCheckr](https://www.pharmatools.ai/refcheckr). Designed to evaluate any AI system built on retrieved documents or reference material.
+## Why not DeepEval?
+Use both. General-purpose frameworks such as DeepEval and OpenAI Evals evaluate AI systems in general. OpenGATE specialises in systems that must justify every answer from evidence:
+- **Provenance is first-class** — does the cited passage actually exist, verbatim, in the source?
+- **Citation fidelity is first-class** — are the right citations detected and mapped to the right references?
+- **Regression detection is first-class** — every run is diffed against a baseline; drops fail the build.
+If your system's core promise is *grounded* answers, these aren't plugins — they're the whole evaluation. That's the niche OpenGATE fills.
+## First evaluation in 60 seconds
+No API key needed — the offline suite tests deterministic logic against the bundled gold set:
+```bash
+npx @pharmatools/opengate
+```
+Or from a clone:
+```bash
+git clone https://github.com/nickjlamb/opengate.git
+cd opengate
+npm run eval
+```
+For programmatic use, the metric and citation primitives are importable directly:
+```js
+import { detectCitations, verdictAccuracy, precisionRecallF1 } from '@pharmatools/opengate';
+```
+```
+OpenGATE — 25 case(s), online=false, sha=eff971b
+  ✓ citation-detection   PASS
+      perClaim_exactSetRate      100.0%
+      perClaim_jaccardMean       100.0%
+      supportedStyle_accuracy    100.0%
+  ⊘ claim-extraction     SKIPPED — online scorer (pass --online)
+  ⊘ verdict-accuracy     SKIPPED — online scorer (pass --online)
+```
+To run the online scorers against a live system (bundled RefCheckr adapter):
+```bash
+export REFCHECKR_BASE_URL=http://localhost:3848
+export REFCHECKR_TOKEN=<a valid auth token>
+export OPENGATE_EVAL_REPEATS=3        # optional: measure verdict consistency
+export OPENGATE_EVAL_MODEL=sonar-pro  # optional: label the model in the scorecard
+npm run eval:online
+npm run eval:baseline   # save current run as the regression reference
+npm run eval:ci         # exit non-zero on any failure or metric regression
+```
+## Architecture
+```
+                        ┌─────────────────────────────────┐
+                        │            OpenGATE             │
+                        │                                 │
+                        │  benchmark datasets (gold sets) │
+                        │  scorers (one per metric family)│
+                        │  scorecards (versioned, on disk)│
+                        │  regression gate (CI)           │
+                        └───────────────┬─────────────────┘
+                                        │ adapters
+                 ┌──────────────┬───────┴──────┬──────────────┐
+                 ▼              ▼              ▼              ▼
+             RefCheckr     Patiently AI     Redacta      your system
+          (first impl.)     (planned)      (planned)    (write an adapter)
+```
+Where it sits in the development loop:
+```
+change a prompt, model, or pipeline
+              ↓
+        run OpenGATE
+              ↓
+     metrics vs baseline?
+        ↙          ↘
+  ▲ improved      ▼ regressed
+    deploy         investigate (build fails in CI)
+```
+## Core concepts
+**Gold cases** — hand-labelled benchmark cases (`datasets/cases/`): source text, the claims that should be extracted, the sentences that should *not* be, and reference snippets with known-correct verdicts. Copy `_template.json` to add one; format spec in `datasets/SCHEMA.md`, labelling rules in `datasets/LABELING-GUIDE.md`.
+**Scorers** — one module per metric family (`src/scorers/`):
+| Scorer | Mode | Metrics |
+|---|---|---|
+| `citation-detection` | offline | per-claim citation set exact-match & Jaccard; supported-style accuracy; tracked known-gap styles |
+| `claim-extraction` | online | precision / recall / F1 vs gold; non-claim leakage; citation agreement; **fidelity** (extracted claim is verbatim from source) |
+| `verdict-accuracy` | online | exact & adjacency accuracy on a six-point support scale; confusion matrix; **passage hallucination rate**; consistency across repeats; per-claim latency (p50/p95) and token usage for real cost/claim |
+Offline scorers run with no API key — fast enough for every commit. Online scorers exercise a live system through an adapter.
+**Scorecards** — every run writes `results/<timestamp>.json` stamped with the git SHA, so any result is reproducible and auditable. Per-model runs carry a `run_model` label, turning the results directory into a measured model comparison (accuracy × hallucination × latency × cost).
+**Regression gate** — `--baseline` saves a reference; subsequent runs print per-metric deltas (▲/▼ in percentage points) and `--ci` fails the build on any drop. No change ships without proving it didn't make the system less reliable.
+## Adapters: evaluating your own system
+Scorers never talk to a system directly — they go through an adapter, injected by the runner. The bundled `src/adapters/refcheckr.mjs` is the reference implementation; select your own with:
+```bash
+OPENGATE_ADAPTER=./adapters/my-system.mjs npm run eval:online
+```
+An adapter is one file with four required exports — `splitClaims(text)`, `analyzeBatch(payload)`, `onlineAvailable()`, `onlineConfigHint()` — plus optional timing/token/model-label hooks that unlock latency and cost columns in the scorecard. Adapters are validated at load: a malformed one fails fast with a message listing every missing export.
+For REST-backed systems there's a no-code path: the bundled **generic HTTP adapter** (`src/adapters/http.mjs`) reads endpoint paths and headers from `opengate.http.json` (see `opengate.http.example.json`), with `${ENV}` interpolation and built-in latency/token capture.
+Full contract, minimal skeleton, and verdict-mapping notes: **[ADAPTERS.md](ADAPTERS.md)**.
+**The methodology travels; only the gold set changes.**
+## CI: the GitHub Action
+Use OpenGATE as a drop-in regression gate in any repository. Keep your gold set and committed `baseline.json` in your own tree; any metric that drops fails the build:
+```yaml
+- uses: nickjlamb/opengate@main
+  with:
+    datasets: ./evals/datasets      # your cases/ + fixtures/
+    results: ./evals/results        # where baseline.json lives
+    adapter: ./evals/my-adapter.mjs # or the bundled HTTP adapter
+    online: 'true'
+  env:
+    MY_SYSTEM_URL: ${{ vars.MY_SYSTEM_URL }}
+    MY_SYSTEM_TOKEN: ${{ secrets.MY_SYSTEM_TOKEN }}
+```
+All inputs are optional — with none, it runs the offline suite against the bundled gold set. The same overrides work locally: `--datasets <dir>` and `--results <dir>` (or `OPENGATE_DATASETS` / `OPENGATE_RESULTS`).
+## Proven in production
+Run against RefCheckr's gold set, OpenGATE:
+- surfaced a **silent parse-failure mode** affecting ~50% of multi-claim verdicts, eliminated with enforced structured output (→ 0);
+- **halved passage hallucination** (5.8% → 2.4%) by driving a measured production model change — a decision made on numbers, not reputation;
+- holds claim extraction at **~0.95 F1** with near-full recall.
+Full methodology and the model comparison: [how RefCheckr is evaluated](https://www.pharmatools.ai/refcheckr-eval).
+## Layout
+```
+opengate/
+  datasets/
+    cases/        gold-labelled source sections (copy _template.json to add)
+    fixtures/     citation-style coverage fixture
+    SCHEMA.md     case format spec
+    LABELING-GUIDE.md
+  src/
+    lib/          metrics.mjs (PRF1, Jaccard, verdict accuracy, confusion matrix)
+                  citations.mjs (reference impl of deterministic citation logic)
+    scorers/      one file per metric family
+    adapters/     system-under-test boundary (refcheckr.mjs is the reference)
+    runner.mjs    CLI: discover cases → run scorers → report → snapshot → regression-check
+  results/        timestamped run snapshots + baseline.json
+```
+## Roadmap
+- **Second adapter** — Patiently AI (faithfulness evaluation for patient-language simplification)
+- **Author-year in RefCheckr production** — `detectAuthorYear()` now lands "Smith 2020"-style keys in the reference implementation; adopting them in RefCheckr's numeric-keyed citation mapping is tracked separately
+- **Number-adjacent superscript** — `week 24.1` is genuinely ambiguous with decimals; remains a tracked known gap
+- **Growing gold set** — more domains, all six verdict types, real-world reference material
+- **Stable adapter surface** — the contract may still shift pre-1.0; semver will signal breaking changes
+## Contributing
+Contributions are welcome, particularly:
+- **Gold cases** — new domains and citation styles (see `datasets/LABELING-GUIDE.md`)
+- **Adapters** — connect OpenGATE to your evidence-grounded system
+- **Scorers** — new metric families that fit the gold-case format
+Open an issue to discuss before large changes. Interfaces — particularly the adapter surface — may still shift pre-1.0.
+## License
+[MIT](LICENSE) — because evaluation frameworks shouldn't be black boxes. If an evaluation influences deployment decisions, engineers should be able to inspect every scorer, metric, and benchmark.

package/datasets/LABELING-GUIDE.md ADDED Viewed

@@ -0,0 +1,65 @@
+# Verdict labeling guide
+The rubric every gold case is labeled against. The goal is **one consistent, defensible standard** so the benchmark measures the system, not the annotator's mood. Each `goldVerdict` is a judgment of one claim against **one specific reference snippet** — never against real-world truth or the wider literature. The only question is: *does this reference support this claim, and how?*
+## The six verdicts
+| Verdict | Use when |
+|---|---|
+| **strong_support** | The reference **explicitly states** the claim in narrative text, and the claim does not exceed it. There is at least one verbatim sentence a reviewer could quote. |
+| **partial_support** | **Part** of the claim is explicitly supported; another part is not addressed or only weakly supported. Typical for compound claims (A and B, where A holds and B doesn't). |
+| **implied_by_data** | The claim follows from **data** (a table, figure, or numbers) but **no narrative sentence states it**. The numbers clearly imply the claim. |
+| **overclaim** | The reference supports a **weaker or narrower** version of the claim, but the claim **overstates** it — asserting significance the reference doesn't establish, a larger magnitude, a broader population/endpoint, or greater certainty — **and the reference does not explicitly deny the overstated part**. Direction is right; strength/scope is too much. |
+| **not_supported** | The reference **does not address** the claim — no evidence for or against. Includes "population excluded / endpoint not evaluated / not reported." |
+| **contradicted** | The reference **explicitly states the opposite** of, or directly negates, the claim — including explicitly saying an effect was **not** significant when the claim asserts significance, or that an outcome did **not** occur when the claim asserts it did. |
+## Decision rules (the tie-breakers)
+These are the boundaries the v0.1 run showed were fuzzy. Apply them in order.
+1. **overclaim vs contradicted** — *Did the reference explicitly negate the over-asserted element?*
+   - The reference says the opposite of the specific assertion (e.g. claim: "significantly reduced X"; reference: "X was **not** significantly reduced") → **contradicted**.
+   - The reference is merely **silent** on the over-asserted element (e.g. claim asserts significance; reference reports the effect size but never mentions significance) → **overclaim**.
+2. **not_supported vs contradicted** — *Is the reference silent, or opposite?*
+   - Silent on the claim → **not_supported**.
+   - States the opposite → **contradicted**.
+3. **not_supported vs overclaim** — *Is there support for even a weaker version?*
+   - No support for any version of the claim → **not_supported**.
+   - Supports a weaker/narrower version but the claim overstates it → **overclaim**.
+4. **strong_support vs partial_support** — Entire claim explicitly supported → **strong**. Only part → **partial**.
+5. **implied_by_data vs strong_support** — A narrative sentence states it → **strong**. Only the data/table shows it → **implied_by_data**.
+## Worked examples
+- Claim "reduced LDL by 58% versus placebo" + ref "reduced LDL by a mean of 58% (p<0.001)" → **strong_support** (explicit, not exceeded).
+- Claim "**significantly** reduced cardiovascular death" + ref "cardiovascular death alone was **not significantly** reduced" → **contradicted** (explicit negation of the significance claim — rule 1).
+- Claim "X **significantly** reduced hospitalisations" + ref "hospitalisations were 18% lower with X" (no p-value, significance never mentioned) → **overclaim** (asserts significance the ref doesn't establish and doesn't deny — rule 1).
+- Claim "improved renal function" + ref "did not assess renal endpoints" → **not_supported** (silent — rule 2).
+- Claim "effective in **all** patients" + ref "effective in patients with **moderate-to-severe** disease" → **overclaim** (scope broadened beyond the reference — rule 3).
+- Claim "reduced HbA1c more than placebo" + ref = data table only (no narrative sentence) → **implied_by_data** (rule 5).
+- Claim "reduced body weight **and** blood pressure" + ref supports weight (p<0.001) but BP "not significant" → **partial_support** (one element holds — rule 4).
+## Ordinal scale (for adjacency scoring)
+Verdicts are ordered by **how well the claim, as written, is borne out by the reference** — used only to compute the softer "off-by-one" adjacency metric. Exact-match remains the primary metric.
+```
+strong_support > partial_support > implied_by_data > overclaim > not_supported > contradicted
+```
+Rationale: `implied_by_data` means the claim *is* supported (just via data), so it ranks above `overclaim` (partly true but overstated), which ranks above `not_supported` (no support), which ranks above `contradicted` (actively refuted). This is a judgment call — it is documented here so the adjacency metric is reproducible, not silent.
+## Per-verdict gold-case field
+Each `goldVerdict` may carry a short `rationale` so a reviewer can verify the label fast:
+```json
+{ "claimText": "...", "citation": 1, "verdict": "overclaim",
+  "rationale": "Ref reports an 18% reduction but never states significance; claim asserts 'significantly' — overstated, not denied (rule 1)." }
+```
+The scorer ignores `rationale`; it exists purely for human verification and provenance.

package/datasets/SCHEMA.md ADDED Viewed

@@ -0,0 +1,17 @@
+# Gold case schema
+One JSON file per case in `datasets/cases/` (files starting with `_` are ignored, e.g. `_template.json`).
+| Field | Required | Used by | Meaning |
+|---|---|---|---|
+| `id` | yes | all | Unique slug. |
+| `title` / `notes` | no | — | Human context / provenance. |
+| `manuscript` | yes | claim-extraction | The pasted section, with citation markers exactly as authored. |
+| `goldClaims[]` | yes | citation-detection, claim-extraction | The verifiable claims a reviewer should extract. Each has `originalText` (with markers), `text` (clean), and `citations` (number array). |
+| `goldNonClaims[]` | no | claim-extraction | Sentences that should **not** be extracted (background, aims, transitions). Drives precision/leakage. |
+| `references{}` | online only | verdict-accuracy | Map of citation-number → `{ name, text }`. |
+| `goldVerdicts[]` | online only | verdict-accuracy | `{ claimText, citation, verdict }` where `verdict` ∈ the six-point scale. Mark `_requires: "online"`. |
+Verdict scale (ordered, strongest support → strongest refutation): `strong_support`, `partial_support`, `implied_by_data`, `not_supported`, `contradicted`, `overclaim`.
+Offline scorers need only `manuscript` + `goldClaims`. Reference texts and gold verdicts are required solely for the online verdict scorer.

package/datasets/cases/_template.json ADDED Viewed

@@ -0,0 +1,27 @@
+{
+  "id": "REPLACE-with-unique-id",
+  "title": "Short human-readable title",
+  "notes": "Where this case came from, what it stresses, any caveats.",
+  "manuscript": "Paste the manuscript section here, with citation markers exactly as the author wrote them.",
+  "goldClaims": [
+    {
+      "originalText": "claim sentence WITH its citation marker, e.g. ...placebo.1,2",
+      "text": "claim sentence WITHOUT citation markers (clean form used for verification)",
+      "citations": [1, 2]
+    }
+  ],
+  "goldNonClaims": [
+    "Sentences a reviewer should NOT extract (background, aims, transitions). Used for precision."
+  ],
+  "references": {
+    "1": { "name": "First-author Year", "text": "Optional: reference text/snippet. Needed only for online verdict accuracy." }
+  },
+  "goldVerdicts": [
+    { "claimText": "clean claim text", "citation": 1, "verdict": "strong_support|partial_support|implied_by_data|overclaim|not_supported|contradicted", "rationale": "one line citing the LABELING-GUIDE.md rule that fixes this verdict", "_requires": "online" }
+  ]
+}

package/datasets/cases/case-cardio-bp.json ADDED Viewed

@@ -0,0 +1,23 @@
+{
+  "id": "cardio-bp",
+  "title": "Blood pressure — clean overclaim (significance not stated)",
+  "notes": "Synthetic; controlled ground truth. Overclaim exemplar: claim asserts significance the reference reports an effect size for but never establishes. VERIFY before publishing.",
+  "manuscript": "Drug Z significantly lowered systolic blood pressure versus placebo.1 Stroke risk was reduced with Drug Z.1 Serious adverse events were comparable between groups.2 This post hoc analysis explored blood-pressure and safety outcomes.",
+  "goldClaims": [
+    { "originalText": "Drug Z significantly lowered systolic blood pressure versus placebo.1", "text": "Drug Z significantly lowered systolic blood pressure versus placebo", "citations": [1] },
+    { "originalText": "Stroke risk was reduced with Drug Z.1", "text": "Stroke risk was reduced with Drug Z", "citations": [1] },
+    { "originalText": "Serious adverse events were comparable between groups.2", "text": "Serious adverse events were comparable between groups", "citations": [2] }
+  ],
+  "goldNonClaims": [
+    "This post hoc analysis explored blood-pressure and safety outcomes."
+  ],
+  "references": {
+    "1": { "name": "ASCEND-BP primary (Khan 2024)", "text": "In this 12-week randomised trial, 620 patients with hypertension received Drug Z or placebo; the primary endpoint was change in office systolic blood pressure. Drug Z lowered systolic blood pressure by a mean of 12 mmHg compared with placebo. Point estimates and confidence intervals for the primary endpoint are reported in the supplementary appendix. This was a short-term, surrogate-endpoint study and was not designed to assess cardiovascular outcomes such as stroke or myocardial infarction." },
+    "2": { "name": "ASCEND-BP safety", "text": "Safety was assessed in all randomised patients. Serious adverse events occurred in 4.1% of patients receiving Drug Z and 4.3% receiving placebo (p=0.78). The most common adverse events were headache and dizziness, with similar frequencies in both groups. Overall tolerability was comparable between Drug Z and placebo." }
+  },
+  "goldVerdicts": [
+    { "claimText": "Drug Z significantly lowered systolic blood pressure versus placebo", "citation": 1, "verdict": "overclaim", "rationale": "Ref reports a 12 mmHg reduction but states no significance; claim asserts 'significantly' — overstated, not denied (rule 1)." },
+    { "claimText": "Stroke risk was reduced with Drug Z", "citation": 1, "verdict": "not_supported", "rationale": "Ref says the trial was not designed to assess stroke — no support for any version of the stroke claim (rule 3)." },
+    { "claimText": "Serious adverse events were comparable between groups", "citation": 2, "verdict": "strong_support", "rationale": "Ref states SAE 4.1% vs 4.3% (p=0.78) and calls tolerability comparable — explicit." }
+  ]
+}

package/datasets/cases/case-cardio-hf.json ADDED Viewed

@@ -0,0 +1,21 @@
+{
+  "id": "cardio-hf",
+  "title": "Heart failure — implied, partial, not-supported",
+  "notes": "Synthetic; controlled ground truth. VERIFY before publishing.",
+  "manuscript": "The drug lowered the rate of heart-failure hospitalisation versus placebo.1 The drug improved symptoms and reduced cardiovascular mortality.1 The drug improved renal outcomes.2",
+  "goldClaims": [
+    { "originalText": "The drug lowered the rate of heart-failure hospitalisation versus placebo.1", "text": "The drug lowered the rate of heart-failure hospitalisation versus placebo", "citations": [1] },
+    { "originalText": "The drug improved symptoms and reduced cardiovascular mortality.1", "text": "The drug improved symptoms and reduced cardiovascular mortality", "citations": [1] },
+    { "originalText": "The drug improved renal outcomes.2", "text": "The drug improved renal outcomes", "citations": [2] }
+  ],
+  "goldNonClaims": [],
+  "references": {
+    "1": { "name": "HF-CARE table (Bauer 2024)", "text": "In this event-driven trial, 3,800 patients with heart failure were randomised to the drug or placebo and followed for a median of 26 months. Efficacy outcomes are shown in Table 2. The rate of heart-failure hospitalisation was 9.1 per 100 patient-years with the drug and 14.7 with placebo. The KCCQ symptom score improved significantly with the drug versus placebo (p<0.001). Cardiovascular mortality did not differ significantly between the groups (hazard ratio 0.96; 95% CI 0.84-1.10)." },
+    "2": { "name": "HF-CARE renal", "text": "This pre-specified analysis focused on cardiovascular and symptom endpoints. Renal outcomes, including changes in eGFR and progression to dialysis, were not among the endpoints assessed in this study." }
+  },
+  "goldVerdicts": [
+    { "claimText": "The drug lowered the rate of heart-failure hospitalisation versus placebo", "citation": 1, "verdict": "implied_by_data", "rationale": "Support is a data table (9.1 vs 14.7 per 100 patient-years) with no narrative sentence stating it (rule 5)." },
+    { "claimText": "The drug improved symptoms and reduced cardiovascular mortality", "citation": 1, "verdict": "partial_support", "rationale": "Symptom (KCCQ) improvement is explicit and significant; CV mortality did not differ — compound claim, one half holds (rule 4)." },
+    { "claimText": "The drug improved renal outcomes", "citation": 2, "verdict": "not_supported", "rationale": "Ref says renal outcomes were not assessed — silent (rule 2)." }
+  ]
+}

package/datasets/cases/case-cardio-lipids.json ADDED Viewed

@@ -0,0 +1,41 @@
+{
+  "id": "cardio-lipids",
+  "title": "Lipid lowering — overclaim and not-supported",
+  "notes": "Synthetic-but-realistic; controlled ground truth. VERIFY before publishing.",
+  "manuscript": "At week 24, treatment with alirocumab reduced LDL cholesterol by 58% versus placebo.1 Alirocumab significantly reduced the rate of cardiovascular death.1 Alirocumab also improved renal function in patients with chronic kidney disease.2 These results were observed across subgroups.",
+  "goldClaims": [
+    {
+      "originalText": "At week 24, treatment with alirocumab reduced LDL cholesterol by 58% versus placebo.1",
+      "text": "At week 24, treatment with alirocumab reduced LDL cholesterol by 58% versus placebo",
+      "citations": [1]
+    },
+    {
+      "originalText": "Alirocumab significantly reduced the rate of cardiovascular death.1",
+      "text": "Alirocumab significantly reduced the rate of cardiovascular death",
+      "citations": [1]
+    },
+    {
+      "originalText": "Alirocumab also improved renal function in patients with chronic kidney disease.2",
+      "text": "Alirocumab also improved renal function in patients with chronic kidney disease",
+      "citations": [2]
+    }
+  ],
+  "goldNonClaims": [
+    "These results were observed across subgroups."
+  ],
+  "references": {
+    "1": {
+      "name": "ODYSSEY-LDL (Roe 2023)",
+      "text": "At week 24, alirocumab reduced LDL cholesterol by a mean of 58% relative to placebo (p<0.001). For the composite of cardiovascular death, myocardial infarction, or stroke, a numerical reduction was observed (hazard ratio 0.85; 95% CI 0.72-1.01), but the difference was not statistically significant; cardiovascular death alone was not significantly reduced."
+    },
+    "2": {
+      "name": "ODYSSEY safety appendix",
+      "text": "Adverse events were balanced between groups. Injection-site reactions were more common with alirocumab. The safety analysis did not assess renal endpoints or kidney function."
+    }
+  },
+  "goldVerdicts": [
+    { "claimText": "At week 24, treatment with alirocumab reduced LDL cholesterol by 58% versus placebo", "citation": 1, "verdict": "strong_support", "rationale": "Ref states the 58% reduction with p<0.001 explicitly; claim does not exceed it." },
+    { "claimText": "Alirocumab significantly reduced the rate of cardiovascular death", "citation": 1, "verdict": "contradicted", "rationale": "Ref explicitly says cardiovascular death alone was NOT significantly reduced — direct negation of the significance assertion (rule 1)." },
+    { "claimText": "Alirocumab also improved renal function in patients with chronic kidney disease", "citation": 2, "verdict": "not_supported", "rationale": "Ref states renal endpoints were not assessed — silent on the claim (rule 2)." }
+  ]
+}

package/datasets/cases/case-derm-psoriasis.json ADDED Viewed

@@ -0,0 +1,22 @@
+{
+  "id": "derm-psoriasis",
+  "title": "Psoriasis — overclaim, implied, partial",
+  "notes": "Synthetic; controlled ground truth. VERIFY before publishing.",
+  "manuscript": "The biologic is effective for psoriasis of all severities.1 The biologic improved joint symptoms.2 The biologic improved nail and scalp psoriasis.3",
+  "goldClaims": [
+    { "originalText": "The biologic is effective for psoriasis of all severities.1", "text": "The biologic is effective for psoriasis of all severities", "citations": [1] },
+    { "originalText": "The biologic improved joint symptoms.2", "text": "The biologic improved joint symptoms", "citations": [2] },
+    { "originalText": "The biologic improved nail and scalp psoriasis.3", "text": "The biologic improved nail and scalp psoriasis", "citations": [3] }
+  ],
+  "goldNonClaims": [],
+  "references": {
+    "1": { "name": "CLEAR-PSO (Haddad 2024)", "text": "Among patients with moderate-to-severe plaque psoriasis, the biologic achieved skin clearance more often than placebo. Patients with mild psoriasis were not studied." },
+    "2": { "name": "CLEAR-PSO joint substudy", "text": "Table 4. Joint pain score change from baseline: biologic -3.1, placebo -0.8." },
+    "3": { "name": "CLEAR-PSO regional", "text": "Nail psoriasis severity improved significantly with the biologic versus placebo (p<0.01). Scalp involvement was not evaluated." }
+  },
+  "goldVerdicts": [
+    { "claimText": "The biologic is effective for psoriasis of all severities", "citation": 1, "verdict": "overclaim", "rationale": "Ref supports moderate-to-severe only and says mild was not studied; claim broadens scope to all severities (rule 3)." },
+    { "claimText": "The biologic improved joint symptoms", "citation": 2, "verdict": "implied_by_data", "rationale": "Support is a data table (joint pain -3.1 vs -0.8) with no narrative sentence (rule 5)." },
+    { "claimText": "The biologic improved nail and scalp psoriasis", "citation": 3, "verdict": "partial_support", "rationale": "Nail improvement is explicit; scalp involvement was not evaluated — compound claim, one half holds (rule 4)." }
+  ]
+}

package/datasets/cases/case-diabetes-glp1.json ADDED Viewed

@@ -0,0 +1,33 @@
+{
+  "id": "diabetes-glp1",
+  "title": "GLP-1 — implied-by-data and partial support",
+  "notes": "Reference 1 is data-only (table-style) with no narrative sentence stating the claim, to probe implied_by_data. Synthetic; VERIFY before publishing.",
+  "manuscript": "At week 30, semaglutide reduced HbA1c more than placebo.1 Semaglutide reduced body weight and systolic blood pressure.2",
+  "goldClaims": [
+    {
+      "originalText": "At week 30, semaglutide reduced HbA1c more than placebo.1",
+      "text": "At week 30, semaglutide reduced HbA1c more than placebo",
+      "citations": [1]
+    },
+    {
+      "originalText": "Semaglutide reduced body weight and systolic blood pressure.2",
+      "text": "Semaglutide reduced body weight and systolic blood pressure",
+      "citations": [2]
+    }
+  ],
+  "goldNonClaims": [],
+  "references": {
+    "1": {
+      "name": "SUSTAIN data table (Lin 2022)",
+      "text": "Table 2. Change from baseline at week 30. HbA1c (%): semaglutide -1.8, placebo -0.3. Fasting plasma glucose (mmol/L): semaglutide -2.6, placebo -0.4. Values are least-squares means."
+    },
+    "2": {
+      "name": "SUSTAIN secondary endpoints",
+      "text": "Semaglutide significantly reduced body weight compared with placebo (-4.5 kg vs -1.0 kg; p<0.001). Systolic blood pressure was numerically lower with semaglutide, but the change was not reported as statistically significant and confidence intervals were not provided."
+    }
+  },
+  "goldVerdicts": [
+    { "claimText": "At week 30, semaglutide reduced HbA1c more than placebo", "citation": 1, "verdict": "implied_by_data", "rationale": "Support is in a data table (-1.8 vs -0.3) with no narrative sentence stating the comparison (rule 5)." },
+    { "claimText": "Semaglutide reduced body weight and systolic blood pressure", "citation": 2, "verdict": "partial_support", "rationale": "Weight reduction is explicit and significant; SBP change is non-significant — compound claim, one half holds (rule 4)." }
+  ]
+}