npm - @pharmatools/opengate - Versions diffs - 0.3.0 → 0.5.0 - Mend

@pharmatools/opengate 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/ADAPTERS.md +3 -1
package/README.md +30 -8
package/datasets/SCHEMA.md +27 -0
package/datasets/cases/_retrieval-template.json +14 -0
package/datasets/cases/retrieval-macra-vascular.json +15 -0
package/datasets/cases/simplify-discharge-brief.json +45 -0
package/datasets/cases/simplify-lab-results.json +41 -0
package/datasets/cases/simplify-medication-change.json +34 -0
package/package.json +1 -1
package/src/adapters/patiently.mjs +64 -0
package/src/adapters/pubcrawl.mjs +107 -0
package/src/lib/adapter.mjs +2 -0
package/src/lib/baseline.mjs +36 -0
package/src/runner.mjs +46 -24
package/src/scorers/retrieval.mjs +143 -0
package/src/scorers/simplification.mjs +163 -0

package/ADAPTERS.md CHANGED Viewed

@@ -47,8 +47,10 @@ Every adapter provides two base exports, plus at least one complete **capability
 - **`qa`** — `splitClaims(text)` + `analyzeBatch(payload)`: systems that extract claims and verify them against references (scorers: claim-extraction, verdict-accuracy).
 - **`redaction`** — `redact(text)`: systems that remove identifiers from text (scorer: redaction). Returns `{ text, entities: [{ value, type }] }`, where `entities` are the identifiers the system removed.
+- **`simplify`** — `simplify({ text, audience?, tone?, length?, language? })`: systems that rewrite source text for a different audience (scorer: simplification). Returns `{ text }`. Scored on anchor recall (critical facts survive), fabricated numbers (nothing invented), and length contracts — paraphrase is expected, so prose is never checked verbatim.
+- **`retrieval`** — `fetchRecord({ id, type? })`: systems that retrieve records from an authority (scorer: retrieval). Returns `{ record }`. Scored on retrieval fidelity — field presence, hand-verified anchor fields, and structural invariants — catching parser regressions that would silently poison downstream grounding. For deterministic systems (no model), not just AI ones.
-Scorers check `adapter.capabilities.<name>` and skip with a reason when a capability is absent — the bundled Redacta adapter (`src/adapters/redacta.mjs`) runs only the redaction scorer, the RefCheckr adapter only the QA scorers.
+Scorers check `adapter.capabilities.<name>` and skip with a reason when a capability is absent — the bundled Redacta adapter runs only redaction, RefCheckr only QA, Patiently only simplification, and PubCrawl (`src/adapters/pubcrawl.mjs`) only retrieval.
 ### Base exports (always required)

package/README.md CHANGED Viewed

@@ -80,8 +80,8 @@ npm run eval:ci         # exit non-zero on any failure or metric regression
                                         │ adapters
                  ┌──────────────┬───────┴──────┬──────────────┐
                  ▼              ▼              ▼              ▼
-             RefCheckr       Redacta      Patiently AI   your system
-          (first impl., QA) (redaction)    (planned)    (write an adapter)
+          RefCheckr    Redacta    Patiently AI   PubCrawl    your system
+         (QA, first) (redaction)  (simplify)   (retrieval)  (write one)
 ```
 Where it sits in the development loop:
@@ -109,12 +109,14 @@ change a prompt, model, or pipeline
 | `claim-extraction` | online | precision / recall / F1 vs gold; non-claim leakage; citation agreement; **fidelity** (extracted claim is verbatim from source) |
 | `verdict-accuracy` | online | exact & adjacency accuracy on a six-point support scale; confusion matrix; **passage hallucination rate**; consistency across repeats; per-claim latency (p50/p95) and token usage for real cost/claim |
 | `redaction` | online | recall on gold identifiers with **leaks as named failures** (verbatim, and word-level for names); over-redaction count; known-gap tracking for documented engine gaps |
+| `simplification` | online | faithfulness of rewritten text: **anchor recall** (critical facts like doses must survive), **fabricated numbers** (nothing invented), length-contract gates, readability grade (info) |
+| `retrieval` | online | fidelity of retrieved records vs the authority: field presence, hand-verified **anchor fields** (author surnames, year, distinctive abstract phrases), and structural invariants that catch parser regressions (collapsed author arrays, `[object Object]` leakage) |
 Offline scorers run with no API key — fast enough for every commit. Online scorers exercise a live system through an adapter.
 **Scorecards** — every run writes `results/<timestamp>.json` stamped with the git SHA, so any result is reproducible and auditable. Per-model runs carry a `run_model` label, turning the results directory into a measured model comparison (accuracy × hallucination × latency × cost).
-**Regression gate** — `--baseline` saves a reference; subsequent runs print per-metric deltas (▲/▼ in percentage points) and `--ci` fails the build on any drop. No change ships without proving it didn't make the system less reliable.
+**Regression gate** — `--baseline` saves a reference; subsequent runs print per-metric deltas (▲/▼ in percentage points) and `--ci` fails the build on any drop. Baselines are **per-adapter** (`baseline.<adapter>.json`), so a PubCrawl retrieval scorecard can't clobber a RefCheckr QA one — each adapter keeps its own reference. No change ships without proving it didn't make the system less reliable.
 ## Adapters: evaluating your own system
@@ -134,13 +136,13 @@ Full contract, minimal skeleton, and verdict-mapping notes: **[ADAPTERS.md](ADAP
 ## CI: the GitHub Action
-Use OpenGATE as a drop-in regression gate in any repository. Keep your gold set and committed `baseline.json` in your own tree; any metric that drops fails the build:
+Use OpenGATE as a drop-in regression gate in any repository. Keep your gold set and committed baseline (`baseline.<adapter>.json`) in your own tree; any metric that drops fails the build:
 ```yaml
 - uses: nickjlamb/opengate@v0
   with:
     datasets: ./evals/datasets      # your cases/ + fixtures/
-    results: ./evals/results        # where baseline.json lives
+    results: ./evals/results        # where baseline.<adapter>.json lives
     adapter: ./evals/my-adapter.mjs # or the bundled HTTP adapter
     online: 'true'
   env:
@@ -171,6 +173,26 @@ node src/runner.mjs --online --adapter ./src/adapters/redacta.mjs
 On its first run against the new gold set, the eval found two real engine bugs — relation phrases like "Next of kin:" swallowed nested name matches, and apostrophe surnames (O'Brien) were dropped from name capture. Both were fixed in `@pharmatools/redacta` 1.2.1 and confirmed by the eval (`knownGap_closed: 2`), then promoted to gold. Street-line address detection followed in 1.3.0, closing the last tracked gap. Current scorecard: **100% recall on 25 gold identifiers, 0 leaks, no open gaps**.
+## Third implementation: Patiently AI
+[Patiently AI](https://www.pharmatools.ai/patiently-ai) exercises the **simplify capability** — faithfulness scoring for text that is paraphrase by design. On its first production run, the eval found the simplifier **dropping safety-critical specifics**: an antibiotic dose vanished from a Brief discharge summary and a haemoglobin value from a lab letter (anchor recall 86%). Root cause: the composed prompt had no faithfulness rule. A preservation rule added to Patiently's tone prompts (additively — the backend is shared with another product) took the next run to **100% anchor recall, 0 dropped facts, 0 fabricated numbers, 0 contract violations** — with the readability grade slightly *better* than before the fix.
+```bash
+node src/runner.mjs --online --adapter ./src/adapters/patiently.mjs
+```
+## Fourth implementation: PubCrawl
+[PubCrawl](https://www.pharmatools.ai/pubcrawl) is the odd one out — an MCP server wrapping PubMed and ClinicalTrials.gov, with **no model**. It exercises the **retrieval capability**: the deterministic layer everything else grounds on. A silent XML-parser regression (a collapsed author array, a merged abstract) would poison every citation built on the record, so the scorer checks retrieval fidelity against hand-verified anchors and structural invariants. The adapter drives PubCrawl through its real MCP interface, so the full production parse path is under test — and `scripts/capture-retrieval-case.mjs` bootstraps gold cases from live records for you to verify against the source.
+```bash
+npm install --no-save @modelcontextprotocol/sdk
+node scripts/capture-retrieval-case.mjs 31904519 > datasets/cases/retrieval-example.json  # then verify anchors
+node src/runner.mjs --online --adapter ./src/adapters/pubcrawl.mjs
+```
+That OpenGATE scores a non-AI system at all is the point: **evidence-grounded AI is only as trustworthy as the retrieval beneath it**, so the retrieval belongs in the same regression gate.
 ## Layout
 ```
@@ -186,14 +208,14 @@ opengate/
     scorers/      one file per metric family
     adapters/     system-under-test boundary (refcheckr.mjs is the reference)
     runner.mjs    CLI: discover cases → run scorers → report → snapshot → regression-check
-  results/        timestamped run snapshots + baseline.json
+  results/        timestamped run snapshots + baseline.<adapter>.json
 ```
 ## Roadmap
-- **Third adapter** — Patiently AI (faithfulness evaluation for patient-language simplification)
-- **Author-year in RefCheckr production** — `detectAuthorYear()` now lands "Smith 2020"-style keys in the reference implementation; adopting them in RefCheckr's numeric-keyed citation mapping is tracked separately
+- **Retrieval breadth** — retrieval currently scores one PubMed record type; extend to full-text, citation formatting, and trial detail across PubCrawl's other tools
+- **Retrieval coverage** — the retrieval gold set is one case; add a single-author paper (the exact array-collapse risk the capability exists to catch), a trial (NCT) record, and a full-text/citation case across PubCrawl's other tools
 - **Number-adjacent superscript** — `week 24.1` is genuinely ambiguous with decimals; remains a tracked known gap
 - **Growing gold set** — more domains, all six verdict types, real-world reference material
 - **Stable adapter surface** — the contract may still shift pre-1.0; semver will signal breaking changes

package/datasets/SCHEMA.md CHANGED Viewed

@@ -28,3 +28,30 @@ Cases for the `redaction` scorer, exercising an adapter's `redact()` capability.
 | `text` | yes | The source clinical note. |
 | `goldEntities[]` | yes | `{ type, value }` — identifiers that MUST be removed. Any that survive (verbatim, or word-level for `*_NAME` types) are **leaks** and fail the run. |
 | `knownGapEntities[]` | no | `{ type, value, comment }` — identifiers the system does not yet catch. Reported as tracked targets (`knownGap_open` / `knownGap_closed`), not failures. Use `comment` to document the reproduction. When a gap closes, promote the entity to `goldEntities`. |
+## Simplification cases (`kind: "simplification"`)
+Cases for the `simplification` scorer, exercising an adapter's `simplify()` capability. Simplified text is paraphrase by design, so the scorer checks what is deterministic and clinically dangerous to get wrong, not verbatim prose.
+| Field | Required | Meaning |
+|---|---|---|
+| `id` / `kind` | yes | `kind` must be `"simplification"`. |
+| `text` | yes | The source clinical text (synthetic). |
+| `audience` / `tone` / `length` | no | Passed to the adapter (defaults: Adult / Informative / Standard). |
+| `anchors[]` | yes | `{ value, aliases? }` — critical facts that MUST survive simplification (drug names, doses, key values, timeframes). Matched case-insensitively, whitespace-tolerant. A missing anchor is a **dropped fact** and fails the run. |
+| `allowedNewNumbers[]` | no | Numbers legitimately introduced by rephrasing (e.g. "twice daily" → "2 times a day"). Any other output number absent from the source is a **fabricated number** and fails the run. |
+| `maxBullets` / `maxWordsPerBullet` | no | Length-contract gates (e.g. Patiently Brief: 3 / 20). Only checked when present. |
+## Retrieval cases (`kind: "retrieval"`)
+Cases for the `retrieval` scorer, exercising an adapter's `fetchRecord()` capability against a deterministic retrieval system (e.g. PubCrawl). The failure that matters is a parser regression: a field dropped, collapsed, or garbled. Anchors are **independent ground truth** — copied from the source record (the PubMed page, the paper), never from the system's own output. Bootstrap with `scripts/capture-retrieval-case.mjs`, then verify.
+| Field | Required | Meaning |
+|---|---|---|
+| `id` / `kind` | yes | `kind` must be `"retrieval"`. |
+| `recordId` | yes | The stable identifier to fetch (PMID, NCT id). |
+| `recordType` | no | `pubmed` (default) or `trial`. |
+| `requireFields[]` | no | Field names that must be present and non-empty in the record (e.g. `title`, `authors`, `year`). |
+| `anchors[]` | no | Per-field ground-truth checks, one of: `{ field, contains }` (verbatim substring survived), `{ field, equals }` (exact value), `{ field, minCount }` (array didn't collapse). At least one of `requireFields`/`anchors` is needed. |
+The scorer also always applies structural invariants: `authors` must be a non-empty array of strings (not a collapsed single-author object), `title` a non-empty string, and no field may serialise to `"[object Object]"`.

package/datasets/cases/_retrieval-template.json ADDED Viewed

@@ -0,0 +1,14 @@
+{
+  "id": "retrieval-EXAMPLE",
+  "kind": "retrieval",
+  "_note": "Template (ignored by the runner — filename starts with _). Copy to a real filename and populate from scripts/capture-retrieval-case.mjs, then VERIFY every anchor against the source paper. Anchors are independent ground truth — copied from the paper, never from PubCrawl's own output.",
+  "recordId": "PMID or NCT id here",
+  "recordType": "pubmed",
+  "requireFields": ["title", "authors", "year"],
+  "anchors": [
+    { "field": "authors", "contains": "FirstAuthorSurname" },
+    { "field": "authors", "minCount": 5 },
+    { "field": "year", "equals": "2020" },
+    { "field": "abstract_sections", "contains": "a distinctive verbatim phrase from the abstract" }
+  ]
+}

package/datasets/cases/retrieval-macra-vascular.json ADDED Viewed

@@ -0,0 +1,15 @@
+{
+  "id": "retrieval-macra-vascular",
+  "kind": "retrieval",
+  "notes": "PMID 31904519 — Ann Vasc Surg 2020. A four-author paper: anchors are chosen to fail exactly where PubCrawl's known-risky XML paths would regress — author-array collapse (minCount), surname parsing (contains, first token of PubCrawl's 'Surname ForeName' format), and abstract-section truncation (distinctive opening phrase). Verified against the PubMed record.",
+  "recordId": "31904519",
+  "recordType": "pubmed",
+  "requireFields": ["title", "authors", "year"],
+  "anchors": [
+    { "field": "authors", "contains": "Haurani" },
+    { "field": "authors", "contains": "Satiani" },
+    { "field": "authors", "minCount": 4 },
+    { "field": "year", "equals": "2020" },
+    { "field": "abstract_sections", "contains": "The Medicare Access and CHIP Reauthorization Act" }
+  ]
+}

package/datasets/cases/simplify-discharge-brief.json ADDED Viewed

@@ -0,0 +1,45 @@
+{
+  "id": "simplify-discharge-brief",
+  "kind": "simplification",
+  "description": "SYNTHETIC discharge note. Brief mode exercises Patiently's documented contract: 2-3 bullets, under 20 words each. Anchors are the clinically critical facts a patient must not lose.",
+  "audience": "Adult",
+  "tone": "Informative",
+  "length": "Brief",
+  "text": "Discharge summary: You were admitted with community-acquired pneumonia. You have been started on amoxicillin 500 mg three times daily for 5 days. Please see your GP in 2 weeks for review. Return to hospital if you develop worsening breathlessness or a fever above 38 degrees.",
+  "anchors": [
+    {
+      "value": "amoxicillin"
+    },
+    {
+      "value": "500 mg",
+      "aliases": [
+        "500mg",
+        "500 milligrams"
+      ]
+    },
+    {
+      "value": "5 days",
+      "aliases": [
+        "five days"
+      ]
+    },
+    {
+      "value": "2 weeks",
+      "aliases": [
+        "two weeks"
+      ]
+    },
+    {
+      "value": "38",
+      "aliases": [
+        "38C",
+        "38 degrees"
+      ]
+    }
+  ],
+  "allowedNewNumbers": [
+    "3"
+  ],
+  "maxBullets": 3,
+  "maxWordsPerBullet": 20
+}

package/datasets/cases/simplify-lab-results.json ADDED Viewed

@@ -0,0 +1,41 @@
+{
+  "id": "simplify-lab-results",
+  "kind": "simplification",
+  "description": "SYNTHETIC lab letter. Standard length, Reassuring tone. Values and the new medication must survive; no invented numbers.",
+  "audience": "Adult",
+  "tone": "Reassuring",
+  "length": "Standard",
+  "text": "Blood test results: your haemoglobin was 9.8 g/dL (below the normal range) and your ferritin was 8 micrograms/L (low), consistent with iron-deficiency anaemia. We recommend starting ferrous sulfate 200 mg twice daily. We will repeat the blood count in 8 weeks.",
+  "anchors": [
+    {
+      "value": "9.8"
+    },
+    {
+      "value": "ferritin"
+    },
+    {
+      "value": "ferrous sulfate",
+      "aliases": [
+        "ferrous sulphate",
+        "iron tablets",
+        "iron supplement"
+      ]
+    },
+    {
+      "value": "200 mg",
+      "aliases": [
+        "200mg",
+        "200 milligrams"
+      ]
+    },
+    {
+      "value": "8 weeks",
+      "aliases": [
+        "eight weeks"
+      ]
+    }
+  ],
+  "allowedNewNumbers": [
+    "2"
+  ]
+}

package/datasets/cases/simplify-medication-change.json ADDED Viewed

@@ -0,0 +1,34 @@
+{
+  "id": "simplify-medication-change",
+  "kind": "simplification",
+  "description": "SYNTHETIC dose-change letter. The new dose is the fact that must never be lost or altered.",
+  "audience": "Adult",
+  "tone": "Informative",
+  "length": "Standard",
+  "text": "Your levothyroxine dose has been increased from 50 micrograms to 100 micrograms once daily, following your recent TSH result of 8.2 mU/L. Please take the new dose each morning before breakfast. We will repeat your thyroid function tests in 6 to 8 weeks.",
+  "anchors": [
+    {
+      "value": "levothyroxine"
+    },
+    {
+      "value": "100 micrograms",
+      "aliases": [
+        "100 mcg",
+        "100mcg",
+        "100 µg"
+      ]
+    },
+    {
+      "value": "morning"
+    },
+    {
+      "value": "6 to 8 weeks",
+      "aliases": [
+        "6-8 weeks",
+        "six to eight weeks",
+        "6–8 weeks"
+      ]
+    }
+  ],
+  "allowedNewNumbers": []
+}

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@pharmatools/opengate",
-  "version": "0.3.0",
+  "version": "0.5.0",
   "type": "module",
   "description": "OpenGATE — Open Grounded AI Testing & Evaluation. An open-source framework for evaluating evidence-grounded AI systems.",
   "license": "MIT",

package/src/adapters/patiently.mjs ADDED Viewed

@@ -0,0 +1,64 @@
+// Patiently AI adapter — third bundled implementation, exercising the
+// framework's simplify capability. Patiently AI (getpatiently.ai) converts
+// clinical text into patient-friendly language via a Firebase Cloud Function.
+//
+// Config via env (the endpoint is public; no token needed):
+//   PATIENTLY_API_URL     override the translate endpoint
+//   PATIENTLY_EVAL_MODEL  optional label recorded in the scorecard
+//
+// Request contract: POST { text, language, audience, tone, length }
+// Response contract: { optimisedText }
+// Patiently-flavoured keys: audience Child|Teenager|Adult|Carer,
+// tone Friendly|Reassuring|Informative, length Brief|Standard|Detailed.
+export const meta = { name: 'patiently' };
+const DEFAULT_URL = 'https://us-central1-medicaltextoptimiser.cloudfunctions.net/translate';
+const URL = process.env.PATIENTLY_API_URL || DEFAULT_URL;
+export function onlineAvailable() {
+  return Boolean(URL);
+}
+export function onlineConfigHint() {
+  return 'Set PATIENTLY_API_URL (or unset it to use the production endpoint).';
+}
+export function runModel() {
+  return process.env.PATIENTLY_EVAL_MODEL || null;
+}
+// ── Timing capture ──────────────────────────────────────────────────────
+const _calls = [];
+export function resetTiming() { _calls.length = 0; }
+export function callLatencies() { return _calls.map(c => c.ms); }
+/**
+ * Simplify capability.
+ * @param {object} req — { text, audience?, tone?, length?, language? }
+ * @returns {Promise<{ text: string }>}
+ */
+export async function simplify(req) {
+  const t0 = performance.now();
+  try {
+    const res = await fetch(URL, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({
+        text: req.text,
+        language: req.language || 'en',
+        audience: req.audience || 'Adult',
+        tone: req.tone || 'Informative',
+        length: req.length || 'Standard',
+      }),
+    });
+    const data = await res.json().catch(() => ({}));
+    if (!res.ok) throw new Error(data.error || `translate → HTTP ${res.status}`);
+    if (typeof data.optimisedText !== 'string') {
+      throw new Error('translate response missing optimisedText');
+    }
+    return { text: data.optimisedText };
+  } finally {
+    _calls.push({ ms: performance.now() - t0 });
+  }
+}

package/src/adapters/pubcrawl.mjs ADDED Viewed

@@ -0,0 +1,107 @@
+// PubCrawl adapter — fourth bundled implementation, exercising the framework's
+// retrieval capability. PubCrawl (@pharmatools/pubcrawl) is an MCP server that
+// gives AI clients access to PubMed, ClinicalTrials.gov, and drug labelling.
+//
+// It is NOT an AI system — it's deterministic wrappers around public APIs. The
+// eval measures RETRIEVAL FIDELITY: does the record a client receives match the
+// authority (right title, all authors, intact abstract)? A silent XML-parser
+// regression here would poison every downstream grounding claim, so this is the
+// foundation the QA/simplify capabilities build on.
+//
+// The adapter talks to PubCrawl through its real MCP interface (no
+// re-implementation of the tool handlers), so the full production parse path is
+// under test. The MCP SDK is imported dynamically — install it alongside
+// PubCrawl; OpenGATE core stays zero-dependency.
+//
+// Config via env:
+//   PUBCRAWL_MCP_URL   HTTP MCP endpoint, e.g. http://localhost:8080/mcp
+//                      (start PubCrawl with `npm run start:http`)
+//   PUBCRAWL_CMD       stdio spawn command (default: "npx")
+//   PUBCRAWL_ARGS      stdio args, space-separated (default: "-y @pharmatools/pubcrawl")
+//   NCBI_API_KEY       forwarded to the server (higher NCBI rate limit)
+// If PUBCRAWL_MCP_URL is set it wins; otherwise the server is spawned over stdio.
+export const meta = { name: 'pubcrawl' };
+let sdk = null;
+let loadError = null;
+try {
+  const [{ Client }, stdio, http] = await Promise.all([
+    import('@modelcontextprotocol/sdk/client/index.js'),
+    import('@modelcontextprotocol/sdk/client/stdio.js'),
+    import('@modelcontextprotocol/sdk/client/streamableHttp.js').catch(() => ({})),
+  ]);
+  sdk = { Client, StdioClientTransport: stdio.StdioClientTransport, StreamableHTTPClientTransport: http.StreamableHTTPClientTransport };
+} catch (err) {
+  loadError = err;
+}
+const HTTP_URL = process.env.PUBCRAWL_MCP_URL;
+export function onlineAvailable() {
+  return Boolean(sdk);
+}
+export function onlineConfigHint() {
+  if (!sdk) {
+    return 'MCP SDK not installed — run: npm install --no-save @modelcontextprotocol/sdk' +
+      (loadError ? ` (${loadError.code || loadError.message})` : '');
+  }
+  return 'PubCrawl adapter ready (set PUBCRAWL_MCP_URL for HTTP, or it spawns the server over stdio).';
+}
+// ── Timing ──────────────────────────────────────────────────────────────
+const _calls = [];
+export function resetTiming() { _calls.length = 0; }
+export function callLatencies() { return _calls.map(c => c.ms); }
+function newTransport() {
+  if (HTTP_URL) {
+    if (!sdk.StreamableHTTPClientTransport) throw new Error('HTTP transport unavailable in this SDK build');
+    return new sdk.StreamableHTTPClientTransport(new URL(HTTP_URL));
+  }
+  const command = process.env.PUBCRAWL_CMD || 'npx';
+  const args = (process.env.PUBCRAWL_ARGS || '-y @pharmatools/pubcrawl').split(/\s+/).filter(Boolean);
+  const env = { ...process.env };
+  return new sdk.StdioClientTransport({ command, args, env });
+}
+// PubCrawl tool per record type. Extend as retrieval gold cases grow.
+const TOOL_FOR = {
+  pubmed: (id) => ({ name: 'get_abstract', arguments: { pmid: String(id) } }),
+  trial: (id) => ({ name: 'get_trial', arguments: { nctId: String(id) } }),
+};
+/**
+ * Retrieval capability. Fetches one record through PubCrawl's MCP interface and
+ * returns its parsed JSON as { record }.
+ * @param {object} req — { id, type? }  (type default: 'pubmed')
+ */
+export async function fetchRecord(req) {
+  const type = req.type || 'pubmed';
+  const build = TOOL_FOR[type];
+  if (!build) throw new Error(`unsupported record type "${type}"`);
+  const t0 = performance.now();
+  const client = new sdk.Client({ name: 'opengate', version: '0' }, { capabilities: {} });
+  const transport = newTransport();
+  try {
+    await client.connect(transport);
+    const res = await client.callTool(build(req.id));
+    if (res.isError) {
+      const msg = res.content?.map(c => c.text).join(' ') || 'tool error';
+      throw new Error(msg);
+    }
+    const text = (res.content || []).find(c => c.type === 'text')?.text ?? '';
+    let record;
+    try {
+      record = JSON.parse(text);
+    } catch {
+      throw new Error(`non-JSON tool response: ${text.slice(0, 120)}`);
+    }
+    return { record };
+  } finally {
+    _calls.push({ ms: performance.now() - t0 });
+    await client.close().catch(() => {});
+  }
+}

package/src/lib/adapter.mjs CHANGED Viewed

@@ -24,6 +24,8 @@ const REQUIRED_BASE = ['onlineAvailable', 'onlineConfigHint'];
 const CAPABILITIES = {
   qa: ['splitClaims', 'analyzeBatch'],       // claim extraction + verdicts against references
   redaction: ['redact'],                      // identifier removal from text
+  simplify: ['simplify'],                     // faithful simplification of source text
+  retrieval: ['fetchRecord'],                 // fidelity of retrieved records vs the authority
 };
 // Optional: validated if present; no-op defaults are supplied if absent, so

package/src/lib/baseline.mjs ADDED Viewed

@@ -0,0 +1,36 @@
+// Per-adapter baseline resolution.
+//
+// A regression baseline only means something relative to the adapter that
+// produced it — a PubCrawl retrieval scorecard and a RefCheckr QA scorecard
+// share no metrics, so a single baseline.json would let one clobber the other.
+// Baselines are therefore keyed by adapter: baseline.<adapter>.json.
+//
+// Legacy migration: an older single baseline.json (which records the adapter
+// that produced it) is still honoured — but ONLY for that same adapter, never
+// cross-adapter.
+/** Filesystem-safe per-adapter baseline filename. */
+export function baselineFileName(adapterName) {
+  const safe = String(adapterName || 'default').replace(/[^a-z0-9_-]+/gi, '-').toLowerCase();
+  return `baseline.${safe}.json`;
+}
+/**
+ * Choose which baseline file to read for a regression check.
+ * @param {string} adapterName
+ * @param {object} present
+ *   { perAdapter: boolean,          // baseline.<adapter>.json exists
+ *     legacy: boolean,              // baseline.json exists
+ *     legacyAdapter: string|null }  // the `adapter` field inside baseline.json
+ * @returns {{ file: string, source: 'per-adapter'|'legacy' } | null}
+ */
+export function resolveBaseline(adapterName, present) {
+  if (present.perAdapter) {
+    return { file: baselineFileName(adapterName), source: 'per-adapter' };
+  }
+  // A legacy baseline is trustworthy only for the adapter that wrote it.
+  if (present.legacy && present.legacyAdapter === adapterName) {
+    return { file: 'baseline.json', source: 'legacy' };
+  }
+  return null;
+}

package/src/runner.mjs CHANGED Viewed

@@ -4,11 +4,12 @@
 //   node src/runner.mjs            # offline scorers only
 //   node src/runner.mjs --online   # also run scorers that hit the API
 //   node src/runner.mjs --ci       # exit non-zero on failure or regression
-//   node src/runner.mjs --baseline # save this run as results/baseline.json
+//   node src/runner.mjs --baseline # save this run as results/baseline.<adapter>.json
 //
 // Discovers gold cases (datasets/cases/*.json) and fixtures (datasets/fixtures/*.json),
 // runs each scorer, prints a summary, writes a timestamped results file, and
-// compares headline metrics against results/baseline.json to catch regressions.
+// compares headline metrics against the per-adapter baseline
+// (results/baseline.<adapter>.json) to catch regressions.
 import { readdir, readFile, writeFile, mkdir } from 'node:fs/promises';
 import { existsSync } from 'node:fs';
@@ -16,6 +17,7 @@ import { fileURLToPath } from 'node:url';
 import { dirname, join, resolve } from 'node:path';
 import { execSync } from 'node:child_process';
 import { loadAdapter } from './lib/adapter.mjs';
+import { baselineFileName, resolveBaseline } from './lib/baseline.mjs';
 const __dirname = dirname(fileURLToPath(import.meta.url));
 const EVAL_ROOT = join(__dirname, '..');
@@ -99,6 +101,8 @@ const SCORERS = [
   './scorers/claim-extraction.mjs',
   './scorers/verdict-accuracy.mjs',
   './scorers/redaction.mjs',
+  './scorers/simplification.mjs',
+  './scorers/retrieval.mjs',
 ];
 async function loadJsonDir(dir, { skipPrefix } = {}) {
@@ -183,31 +187,49 @@ async function main() {
   };
   await writeFile(join(RESULTS_DIR, `${stamp}.json`), JSON.stringify({ ...snapshot, results }, null, 2));
   if (SAVE_BASELINE) {
-    await writeFile(join(RESULTS_DIR, 'baseline.json'), JSON.stringify(snapshot, null, 2));
-    console.log('\n  saved baseline.json');
+    // Baselines are per-adapter — a PubCrawl scorecard must not overwrite a
+    // RefCheckr one (they share no metrics).
+    const fname = baselineFileName(adapter.name);
+    await writeFile(join(RESULTS_DIR, fname), JSON.stringify(snapshot, null, 2));
+    console.log(`\n  saved ${fname}`);
   }
-  // ── Regression check vs baseline ──
+  // ── Regression check vs the baseline for THIS adapter ──
   let regressed = false;
-  const baselinePath = join(RESULTS_DIR, 'baseline.json');
-  if (existsSync(baselinePath) && !SAVE_BASELINE) {
-    const base = JSON.parse(await readFile(baselinePath, 'utf8'));
-    const baseById = Object.fromEntries(base.results.map(r => [r.id, r]));
-    console.log('\n  vs baseline:');
-    for (const r of results) {
-      const b = baseById[r.id];
-      if (!b || !r.metrics || !b.metrics) continue;
-      for (const [k, v] of Object.entries(r.metrics)) {
-        if (typeof v !== 'number' || typeof b.metrics[k] !== 'number') continue;
-        const delta = v - b.metrics[k];
-        if (Math.abs(delta) < 1e-9) continue;
-        const arrow = delta > 0 ? '▲' : '▼';
-        // Only rate metrics gate regressions (higher = better); counts shown for info.
-        if (isRateKey(k)) {
-          if (delta < -1e-9) regressed = true;
-          console.log(`      ${r.id}.${k}: ${arrow} ${delta > 0 ? '+' : ''}${(delta * 100).toFixed(1)}pp`);
-        } else {
-          console.log(`      ${r.id}.${k}: ${arrow} ${delta > 0 ? '+' : ''}${delta}`);
+  if (!SAVE_BASELINE) {
+    const legacyPath = join(RESULTS_DIR, 'baseline.json');
+    const perAdapterPath = join(RESULTS_DIR, baselineFileName(adapter.name));
+    let legacyAdapter = null;
+    if (existsSync(legacyPath)) {
+      try { legacyAdapter = JSON.parse(await readFile(legacyPath, 'utf8')).adapter ?? null; } catch { /* ignore */ }
+    }
+    const chosen = resolveBaseline(adapter.name, {
+      perAdapter: existsSync(perAdapterPath),
+      legacy: existsSync(legacyPath),
+      legacyAdapter,
+    });
+    if (chosen) {
+      const base = JSON.parse(await readFile(join(RESULTS_DIR, chosen.file), 'utf8'));
+      const baseById = Object.fromEntries(base.results.map(r => [r.id, r]));
+      console.log(`\n  vs baseline (${chosen.file}):`);
+      if (chosen.source === 'legacy') {
+        console.log(`      note: using legacy baseline.json — re-run --baseline to migrate to ${baselineFileName(adapter.name)}`);
+      }
+      for (const r of results) {
+        const b = baseById[r.id];
+        if (!b || !r.metrics || !b.metrics) continue;
+        for (const [k, v] of Object.entries(r.metrics)) {
+          if (typeof v !== 'number' || typeof b.metrics[k] !== 'number') continue;
+          const delta = v - b.metrics[k];
+          if (Math.abs(delta) < 1e-9) continue;
+          const arrow = delta > 0 ? '▲' : '▼';
+          // Only rate metrics gate regressions (higher = better); counts shown for info.
+          if (isRateKey(k)) {
+            if (delta < -1e-9) regressed = true;
+            console.log(`      ${r.id}.${k}: ${arrow} ${delta > 0 ? '+' : ''}${(delta * 100).toFixed(1)}pp`);
+          } else {
+            console.log(`      ${r.id}.${k}: ${arrow} ${delta > 0 ? '+' : ''}${delta}`);
+          }
         }
       }
     }

package/src/scorers/retrieval.mjs ADDED Viewed

@@ -0,0 +1,143 @@
+// ONLINE scorer — retrieval fidelity.
+//
+// Exercises the adapter's fetchRecord() capability against gold cases of kind
+// "retrieval": a stable record ID plus a few HAND-VERIFIED anchor fields.
+// Unlike the model-scoring scorers, the system under test is deterministic
+// (PubCrawl wraps NCBI / ClinicalTrials.gov). The failure that matters here is
+// a PARSE REGRESSION: a record comes back but a field is dropped, collapsed,
+// or garbled — e.g. an authors array that collapses to a single object when a
+// paper has one author, or an abstract whose sections merge or truncate.
+// Because everything downstream (RefCheckr, any RAG) grounds on these records,
+// a silent parse bug poisons every citation built on it.
+//
+// Ground truth is independent of the system: anchors are copied from the
+// paper/record itself, not from PubCrawl's output (see datasets/SCHEMA.md and
+// scripts/capture-retrieval-case.mjs to bootstrap them). The scorer checks:
+//
+//   • field presence (gate) — every field named in `requireFields` is present
+//     and non-empty (title, authors, year, …). Catches dropped/blank fields.
+//   • anchor fidelity (gate) — each anchor must be satisfied by the record:
+//       { field: "authors",  contains: "Douglas" }   surname survived parsing
+//       { field: "year",     equals: "2020" }
+//       { field: "abstract", contains: "proptosis" } distinctive phrase intact
+//       { field: "authors",  minCount: 5 }            array didn't collapse
+//   • structural invariants (gate) — authors is a non-empty array of strings
+//     (not "[object Object]" from a collapsed single-author record); title is a
+//     non-empty string; no field serialises to "[object Object]".
+export const meta = { id: 'retrieval', mode: 'online' };
+const norm = (s) => String(s).toLowerCase().replace(/\s+/g, ' ').trim();
+/** Read a possibly-nested field from a record; arrays are joined for text ops. */
+function fieldValue(record, field) {
+  const v = record?.[field];
+  if (Array.isArray(v)) return v;
+  return v;
+}
+function asText(v) {
+  return Array.isArray(v) ? v.map(x => (typeof x === 'string' ? x : JSON.stringify(x))).join(' ') : String(v ?? '');
+}
+function checkAnchor(record, a) {
+  const v = fieldValue(record, a.field);
+  if (a.minCount != null) {
+    return Array.isArray(v) && v.length >= a.minCount
+      ? null : `${a.field} has ${Array.isArray(v) ? v.length : 0} items, expected ≥ ${a.minCount}`;
+  }
+  if (a.equals != null) {
+    return norm(asText(v)) === norm(a.equals) ? null : `${a.field} = "${asText(v)}", expected "${a.equals}"`;
+  }
+  if (a.contains != null) {
+    return norm(asText(v)).includes(norm(a.contains)) ? null : `${a.field} missing "${a.contains}"`;
+  }
+  return `anchor on ${a.field} has no check (equals/contains/minCount)`;
+}
+function structuralProblems(record) {
+  const problems = [];
+  const flat = JSON.stringify(record);
+  if (flat.includes('[object Object]')) problems.push('record contains "[object Object]" (a value failed to serialise — likely a parser shape bug)');
+  if ('authors' in record) {
+    const a = record.authors;
+    if (!Array.isArray(a)) problems.push('authors is not an array (single-author collapse?)');
+    else if (a.some(x => typeof x !== 'string' || !x.trim())) problems.push('authors contains a non-string / empty entry');
+  }
+  if ('title' in record && (typeof record.title !== 'string' || !record.title.trim())) {
+    problems.push('title is missing or not a non-empty string');
+  }
+  return problems;
+}
+export async function run({ cases, adapter }) {
+  if (!adapter.capabilities.retrieval) {
+    return { meta, skipped: true, reason: `adapter "${adapter.name}" has no retrieval capability` };
+  }
+  if (!adapter.onlineAvailable()) {
+    return { meta, skipped: true, reason: adapter.onlineConfigHint() };
+  }
+  const goldCases = cases.filter(c => c.kind === 'retrieval' && c.id && (c.anchors || c.requireFields));
+  if (goldCases.length === 0) {
+    return { meta, skipped: true, reason: 'No cases of kind "retrieval" with anchors/requireFields.' };
+  }
+  adapter.resetTiming();
+  const perCase = [];
+  const failures = [];
+  for (const c of goldCases) {
+    let record;
+    try {
+      const res = await adapter.fetchRecord({ id: c.recordId, type: c.recordType });
+      record = res?.record ?? res;
+    } catch (err) {
+      failures.push(`case ${c.id}: ${err.message}`);
+      continue;
+    }
+    if (!record || typeof record !== 'object') {
+      failures.push(`NO RECORD in ${c.id}: fetchRecord returned nothing usable`);
+      continue;
+    }
+    const caseFailures = [];
+    for (const f of c.requireFields || []) {
+      const v = fieldValue(record, f);
+      const empty = v == null || (Array.isArray(v) ? v.length === 0 : String(v).trim() === '');
+      if (empty) caseFailures.push(`missing field "${f}"`);
+    }
+    for (const a of c.anchors || []) {
+      const problem = checkAnchor(record, a);
+      if (problem) caseFailures.push(`anchor ${JSON.stringify(a)} — ${problem}`);
+    }
+    for (const s of structuralProblems(record)) caseFailures.push(s);
+    for (const cf of caseFailures) failures.push(`FIDELITY ${c.id}: ${cf}`);
+    perCase.push({
+      case: c.id, recordId: c.recordId,
+      anchors: (c.anchors || []).length,
+      requireFields: (c.requireFields || []).length,
+      problems: caseFailures,
+    });
+  }
+  const totalChecks = perCase.reduce((a, p) => a + p.anchors + p.requireFields, 0);
+  const failedChecks = perCase.reduce((a, p) => a + p.problems.length, 0);
+  const latencies = adapter.callLatencies();
+  const metrics = {
+    n_cases: perCase.length,
+    n_checks: totalChecks,
+    fidelity: round(totalChecks ? 1 - Math.min(failedChecks, totalChecks) / totalChecks : (perCase.length && !failedChecks ? 1 : 0)),
+    failed_checks: failedChecks,
+    ...(latencies.length ? { latency_p50_ms: Math.round(percentileOf(latencies, 50)) } : {}),
+  };
+  return { meta, metrics, detail: { perCase }, failures, passed: failures.length === 0 };
+}
+function percentileOf(values, p) {
+  const xs = values.filter(Number.isFinite).sort((a, b) => a - b);
+  return xs.length ? xs[Math.min(xs.length - 1, Math.floor((p / 100) * xs.length))] : 0;
+}
+function round(x) { return Math.round(x * 1000) / 1000; }

package/src/scorers/simplification.mjs ADDED Viewed

@@ -0,0 +1,163 @@
+// ONLINE scorer — simplification faithfulness.
+//
+// Exercises the adapter's simplify() capability against gold cases of kind
+// "simplification": a source clinical text plus the facts that must survive
+// simplification. Simplified text is paraphrase BY DESIGN, so verbatim checks
+// on prose are meaningless — instead the scorer measures what is
+// deterministically checkable and clinically dangerous to get wrong:
+//
+//   • anchor recall (gate) — critical facts (drug names, doses, key values)
+//     must appear in the output, matched case-insensitively with per-anchor
+//     aliases ("5 mg" ≈ "5mg"). A dropped dose is the failure that matters.
+//   • fabricated numbers (gate) — every number in the output must exist in
+//     the source, an anchor/alias, or the case's allowedNewNumbers list.
+//     An invented value in patient-facing text is the worst failure there is.
+//   • length contract (gate, when the case specifies) — e.g. Patiently's
+//     Brief mode: ≤3 bullets, ≤20 words per bullet. Contract drift here
+//     previously hid silent prompt-fallback bugs.
+//   • readability (info) — Flesch-Kincaid grade of the output; reported,
+//     not gated, in v1.
+//
+// Case schema (datasets/SCHEMA.md):
+//   { "id", "kind": "simplification", "text", "audience", "tone", "length",
+//     "anchors": [{ "value", "aliases": [] }],
+//     "allowedNewNumbers": ["2"], "maxBullets": 3, "maxWordsPerBullet": 20 }
+import { mean } from '../lib/metrics.mjs';
+export const meta = { id: 'simplification', mode: 'online' };
+const norm = (s) => String(s).toLowerCase().replace(/\s+/g, ' ');
+/** Whitespace-tolerant, case-insensitive containment. */
+function contains(haystack, needle) {
+  const h = norm(haystack).replace(/\s/g, '');
+  const n = norm(needle).replace(/\s/g, '');
+  return n.length > 0 && h.includes(n);
+}
+const NUM_RE = /\d+(?:\.\d+)?/g;
+const numbersIn = (s) => new Set((String(s).match(NUM_RE) || []).map(n => n.replace(/^0+(?=\d)/, '')));
+/** Lines that look like bullets: -, •, *, or "1." style. */
+function bulletLines(text) {
+  return String(text).split(/\r?\n/).map(l => l.trim())
+    .filter(l => /^([-•*]|\d+[.)])\s+/.test(l));
+}
+// Flesch-Kincaid grade with a vowel-group syllable heuristic. Approximate,
+// which is why readability is reported rather than gated.
+function fleschKincaidGrade(text) {
+  const words = String(text).toLowerCase().match(/[a-z]+/g) || [];
+  const sentences = Math.max(1, (String(text).match(/[.!?]+/g) || []).length);
+  if (!words.length) return null;
+  let syllables = 0;
+  for (const w of words) {
+    const groups = (w.replace(/e$/, '').match(/[aeiouy]+/g) || []).length;
+    syllables += Math.max(1, groups);
+  }
+  return 0.39 * (words.length / sentences) + 11.8 * (syllables / words.length) - 15.59;
+}
+export async function run({ cases, adapter }) {
+  if (!adapter.capabilities.simplify) {
+    return { meta, skipped: true, reason: `adapter "${adapter.name}" has no simplify capability` };
+  }
+  if (!adapter.onlineAvailable()) {
+    return { meta, skipped: true, reason: adapter.onlineConfigHint() };
+  }
+  const goldCases = cases.filter(c => c.kind === 'simplification' && (c.anchors || []).length);
+  if (goldCases.length === 0) {
+    return { meta, skipped: true, reason: 'No cases of kind "simplification" with anchors.' };
+  }
+  adapter.resetTiming();
+  const perCase = [];
+  const failures = [];
+  for (const c of goldCases) {
+    let out;
+    try {
+      const res = await adapter.simplify({
+        text: c.text, audience: c.audience, tone: c.tone, length: c.length, language: c.language,
+      });
+      out = res.text ?? '';
+    } catch (err) {
+      failures.push(`case ${c.id}: ${err.message}`);
+      continue;
+    }
+    // Anchor recall: value or any alias must survive.
+    const missed = (c.anchors || []).filter(a =>
+      ![a.value, ...(a.aliases || [])].some(v => contains(out, v)));
+    for (const a of missed) {
+      failures.push(`DROPPED FACT in ${c.id}: anchor "${a.value}" absent from simplified output`);
+    }
+    // Fabricated numbers: output numbers must come from somewhere legitimate.
+    const legitimate = new Set([
+      ...numbersIn(c.text),
+      ...(c.anchors || []).flatMap(a => [...numbersIn(a.value), ...(a.aliases || []).flatMap(x => [...numbersIn(x)])]),
+      ...(c.allowedNewNumbers || []).map(String),
+    ]);
+    const fabricated = [...numbersIn(out)].filter(n => !legitimate.has(n));
+    for (const n of fabricated) {
+      failures.push(`FABRICATED NUMBER in ${c.id}: "${n}" appears in output but not in source`);
+    }
+    // Length contract (only when the case declares one).
+    const bullets = bulletLines(out);
+    const contractViolations = [];
+    if (c.maxBullets != null && bullets.length > c.maxBullets) {
+      contractViolations.push(`${bullets.length} bullets > max ${c.maxBullets}`);
+    }
+    if (c.maxWordsPerBullet != null) {
+      for (const b of bullets) {
+        const words = b.replace(/^([-•*]|\d+[.)])\s+/, '').split(/\s+/).filter(Boolean).length;
+        if (words > c.maxWordsPerBullet) contractViolations.push(`bullet has ${words} words > max ${c.maxWordsPerBullet}`);
+      }
+    }
+    if (c.maxBullets != null && bullets.length === 0) {
+      contractViolations.push('bullet output expected, none found');
+    }
+    for (const v of contractViolations) failures.push(`CONTRACT in ${c.id}: ${v}`);
+    perCase.push({
+      case: c.id,
+      anchors: (c.anchors || []).length,
+      anchorsMissed: missed.map(a => a.value),
+      fabricated,
+      contractViolations,
+      grade: round(fleschKincaidGrade(out) ?? -1),
+      outputChars: out.length,
+      bullets: bullets.length,
+      // The output itself, so a dropped-fact failure can be diagnosed from the
+      // snapshot (was it omitted, reworded past the aliases, or replaced?).
+      output: out.slice(0, 600),
+    });
+  }
+  const totalAnchors = perCase.reduce((a, p) => a + p.anchors, 0);
+  const totalMissed = perCase.reduce((a, p) => a + p.anchorsMissed.length, 0);
+  const latencies = adapter.callLatencies();
+  const metrics = {
+    n_cases: perCase.length,
+    n_anchors: totalAnchors,
+    anchor_recall: round(totalAnchors ? 1 - totalMissed / totalAnchors : 0),
+    dropped_facts: totalMissed,
+    fabricated_numbers: perCase.reduce((a, p) => a + p.fabricated.length, 0),
+    contract_violations: perCase.reduce((a, p) => a + p.contractViolations.length, 0),
+    mean_grade: round(mean(perCase.map(p => p.grade).filter(g => g >= 0))),
+    ...(latencies.length ? { latency_p50_ms: Math.round(percentileOf(latencies, 50)) } : {}),
+    ...(adapter.runModel() ? { run_model: adapter.runModel() } : {}),
+  };
+  return { meta, metrics, detail: { perCase }, failures, passed: failures.length === 0 };
+}
+function percentileOf(values, p) {
+  const xs = values.filter(Number.isFinite).sort((a, b) => a - b);
+  return xs.length ? xs[Math.min(xs.length - 1, Math.floor((p / 100) * xs.length))] : 0;
+}
+function round(x) { return Math.round(x * 100) / 100; }