@pharmatools/opengate 0.2.1 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/ADAPTERS.md CHANGED
@@ -47,8 +47,10 @@ Every adapter provides two base exports, plus at least one complete **capability
47
47
 
48
48
  - **`qa`** — `splitClaims(text)` + `analyzeBatch(payload)`: systems that extract claims and verify them against references (scorers: claim-extraction, verdict-accuracy).
49
49
  - **`redaction`** — `redact(text)`: systems that remove identifiers from text (scorer: redaction). Returns `{ text, entities: [{ value, type }] }`, where `entities` are the identifiers the system removed.
50
+ - **`simplify`** — `simplify({ text, audience?, tone?, length?, language? })`: systems that rewrite source text for a different audience (scorer: simplification). Returns `{ text }`. Scored on anchor recall (critical facts survive), fabricated numbers (nothing invented), and length contracts — paraphrase is expected, so prose is never checked verbatim.
51
+ - **`retrieval`** — `fetchRecord({ id, type? })`: systems that retrieve records from an authority (scorer: retrieval). Returns `{ record }`. Scored on retrieval fidelity — field presence, hand-verified anchor fields, and structural invariants — catching parser regressions that would silently poison downstream grounding. For deterministic systems (no model), not just AI ones.
50
52
 
51
- Scorers check `adapter.capabilities.<name>` and skip with a reason when a capability is absent — the bundled Redacta adapter (`src/adapters/redacta.mjs`) runs only the redaction scorer, the RefCheckr adapter only the QA scorers.
53
+ Scorers check `adapter.capabilities.<name>` and skip with a reason when a capability is absent — the bundled Redacta adapter runs only redaction, RefCheckr only QA, Patiently only simplification, and PubCrawl (`src/adapters/pubcrawl.mjs`) only retrieval.
52
54
 
53
55
  ### Base exports (always required)
54
56
 
package/README.md CHANGED
@@ -80,8 +80,8 @@ npm run eval:ci # exit non-zero on any failure or metric regression
80
80
  │ adapters
81
81
  ┌──────────────┬───────┴──────┬──────────────┐
82
82
  ▼ ▼ ▼ ▼
83
- RefCheckr Redacta Patiently AI your system
84
- (first impl., QA) (redaction) (planned) (write an adapter)
83
+ RefCheckr Redacta Patiently AI PubCrawl your system
84
+ (QA, first) (redaction) (simplify) (retrieval) (write one)
85
85
  ```
86
86
 
87
87
  Where it sits in the development loop:
@@ -109,6 +109,8 @@ change a prompt, model, or pipeline
109
109
  | `claim-extraction` | online | precision / recall / F1 vs gold; non-claim leakage; citation agreement; **fidelity** (extracted claim is verbatim from source) |
110
110
  | `verdict-accuracy` | online | exact & adjacency accuracy on a six-point support scale; confusion matrix; **passage hallucination rate**; consistency across repeats; per-claim latency (p50/p95) and token usage for real cost/claim |
111
111
  | `redaction` | online | recall on gold identifiers with **leaks as named failures** (verbatim, and word-level for names); over-redaction count; known-gap tracking for documented engine gaps |
112
+ | `simplification` | online | faithfulness of rewritten text: **anchor recall** (critical facts like doses must survive), **fabricated numbers** (nothing invented), length-contract gates, readability grade (info) |
113
+ | `retrieval` | online | fidelity of retrieved records vs the authority: field presence, hand-verified **anchor fields** (author surnames, year, distinctive abstract phrases), and structural invariants that catch parser regressions (collapsed author arrays, `[object Object]` leakage) |
112
114
 
113
115
  Offline scorers run with no API key — fast enough for every commit. Online scorers exercise a live system through an adapter.
114
116
 
@@ -171,6 +173,26 @@ node src/runner.mjs --online --adapter ./src/adapters/redacta.mjs
171
173
 
172
174
  On its first run against the new gold set, the eval found two real engine bugs — relation phrases like "Next of kin:" swallowed nested name matches, and apostrophe surnames (O'Brien) were dropped from name capture. Both were fixed in `@pharmatools/redacta` 1.2.1 and confirmed by the eval (`knownGap_closed: 2`), then promoted to gold. Street-line address detection followed in 1.3.0, closing the last tracked gap. Current scorecard: **100% recall on 25 gold identifiers, 0 leaks, no open gaps**.
173
175
 
176
+ ## Third implementation: Patiently AI
177
+
178
+ [Patiently AI](https://www.pharmatools.ai/patiently-ai) exercises the **simplify capability** — faithfulness scoring for text that is paraphrase by design. On its first production run, the eval found the simplifier **dropping safety-critical specifics**: an antibiotic dose vanished from a Brief discharge summary and a haemoglobin value from a lab letter (anchor recall 86%). Root cause: the composed prompt had no faithfulness rule. A preservation rule added to Patiently's tone prompts (additively — the backend is shared with another product) took the next run to **100% anchor recall, 0 dropped facts, 0 fabricated numbers, 0 contract violations** — with the readability grade slightly *better* than before the fix.
179
+
180
+ ```bash
181
+ node src/runner.mjs --online --adapter ./src/adapters/patiently.mjs
182
+ ```
183
+
184
+ ## Fourth implementation: PubCrawl
185
+
186
+ [PubCrawl](https://www.pharmatools.ai/pubcrawl) is the odd one out — an MCP server wrapping PubMed and ClinicalTrials.gov, with **no model**. It exercises the **retrieval capability**: the deterministic layer everything else grounds on. A silent XML-parser regression (a collapsed author array, a merged abstract) would poison every citation built on the record, so the scorer checks retrieval fidelity against hand-verified anchors and structural invariants. The adapter drives PubCrawl through its real MCP interface, so the full production parse path is under test — and `scripts/capture-retrieval-case.mjs` bootstraps gold cases from live records for you to verify against the source.
187
+
188
+ ```bash
189
+ npm install --no-save @modelcontextprotocol/sdk
190
+ node scripts/capture-retrieval-case.mjs 31904519 > datasets/cases/retrieval-example.json # then verify anchors
191
+ node src/runner.mjs --online --adapter ./src/adapters/pubcrawl.mjs
192
+ ```
193
+
194
+ That OpenGATE scores a non-AI system at all is the point: **evidence-grounded AI is only as trustworthy as the retrieval beneath it**, so the retrieval belongs in the same regression gate.
195
+
174
196
  ## Layout
175
197
 
176
198
  ```
@@ -192,7 +214,7 @@ opengate/
192
214
  ## Roadmap
193
215
 
194
216
 
195
- - **Third adapter** — Patiently AI (faithfulness evaluation for patient-language simplification)
217
+ - **Per-adapter baselines** — `results/baseline.json` is a single reference; runs against different adapters (RefCheckr QA vs Patiently simplification) should baseline separately (workable today via `--results <dir>`, first-class support planned)
196
218
  - **Author-year in RefCheckr production** — `detectAuthorYear()` now lands "Smith 2020"-style keys in the reference implementation; adopting them in RefCheckr's numeric-keyed citation mapping is tracked separately
197
219
  - **Number-adjacent superscript** — `week 24.1` is genuinely ambiguous with decimals; remains a tracked known gap
198
220
  - **Growing gold set** — more domains, all six verdict types, real-world reference material
@@ -7,7 +7,7 @@ One JSON file per case in `datasets/cases/` (files starting with `_` are ignored
7
7
  | `id` | yes | all | Unique slug. |
8
8
  | `title` / `notes` | no | — | Human context / provenance. |
9
9
  | `manuscript` | yes | claim-extraction | The pasted section, with citation markers exactly as authored. |
10
- | `goldClaims[]` | yes | citation-detection, claim-extraction | The verifiable claims a reviewer should extract. Each has `originalText` (with markers), `text` (clean), and `citations` (number array). |
10
+ | `goldClaims[]` | yes | citation-detection, claim-extraction | The verifiable claims a reviewer should extract. Each has `originalText` (with markers), `text` (clean), and `citations` — an array of numbers and/or author-year string keys (e.g. "Smith 2020", "Meyer 2020a"). Numeric [N] markers are stripped from `text`; author-year mentions are grammatical prose and are never stripped, so for pure author-year claims `text === originalText`. |
11
11
  | `goldNonClaims[]` | no | claim-extraction | Sentences that should **not** be extracted (background, aims, transitions). Drives precision/leakage. |
12
12
  | `references{}` | online only | verdict-accuracy | Map of citation-number → `{ name, text }`. |
13
13
  | `goldVerdicts[]` | online only | verdict-accuracy | `{ claimText, citation, verdict }` where `verdict` ∈ the six-point scale. Mark `_requires: "online"`. |
@@ -28,3 +28,30 @@ Cases for the `redaction` scorer, exercising an adapter's `redact()` capability.
28
28
  | `text` | yes | The source clinical note. |
29
29
  | `goldEntities[]` | yes | `{ type, value }` — identifiers that MUST be removed. Any that survive (verbatim, or word-level for `*_NAME` types) are **leaks** and fail the run. |
30
30
  | `knownGapEntities[]` | no | `{ type, value, comment }` — identifiers the system does not yet catch. Reported as tracked targets (`knownGap_open` / `knownGap_closed`), not failures. Use `comment` to document the reproduction. When a gap closes, promote the entity to `goldEntities`. |
31
+
32
+ ## Simplification cases (`kind: "simplification"`)
33
+
34
+ Cases for the `simplification` scorer, exercising an adapter's `simplify()` capability. Simplified text is paraphrase by design, so the scorer checks what is deterministic and clinically dangerous to get wrong, not verbatim prose.
35
+
36
+ | Field | Required | Meaning |
37
+ |---|---|---|
38
+ | `id` / `kind` | yes | `kind` must be `"simplification"`. |
39
+ | `text` | yes | The source clinical text (synthetic). |
40
+ | `audience` / `tone` / `length` | no | Passed to the adapter (defaults: Adult / Informative / Standard). |
41
+ | `anchors[]` | yes | `{ value, aliases? }` — critical facts that MUST survive simplification (drug names, doses, key values, timeframes). Matched case-insensitively, whitespace-tolerant. A missing anchor is a **dropped fact** and fails the run. |
42
+ | `allowedNewNumbers[]` | no | Numbers legitimately introduced by rephrasing (e.g. "twice daily" → "2 times a day"). Any other output number absent from the source is a **fabricated number** and fails the run. |
43
+ | `maxBullets` / `maxWordsPerBullet` | no | Length-contract gates (e.g. Patiently Brief: 3 / 20). Only checked when present. |
44
+
45
+ ## Retrieval cases (`kind: "retrieval"`)
46
+
47
+ Cases for the `retrieval` scorer, exercising an adapter's `fetchRecord()` capability against a deterministic retrieval system (e.g. PubCrawl). The failure that matters is a parser regression: a field dropped, collapsed, or garbled. Anchors are **independent ground truth** — copied from the source record (the PubMed page, the paper), never from the system's own output. Bootstrap with `scripts/capture-retrieval-case.mjs`, then verify.
48
+
49
+ | Field | Required | Meaning |
50
+ |---|---|---|
51
+ | `id` / `kind` | yes | `kind` must be `"retrieval"`. |
52
+ | `recordId` | yes | The stable identifier to fetch (PMID, NCT id). |
53
+ | `recordType` | no | `pubmed` (default) or `trial`. |
54
+ | `requireFields[]` | no | Field names that must be present and non-empty in the record (e.g. `title`, `authors`, `year`). |
55
+ | `anchors[]` | no | Per-field ground-truth checks, one of: `{ field, contains }` (verbatim substring survived), `{ field, equals }` (exact value), `{ field, minCount }` (array didn't collapse). At least one of `requireFields`/`anchors` is needed. |
56
+
57
+ The scorer also always applies structural invariants: `authors` must be a non-empty array of strings (not a collapsed single-author object), `title` a non-empty string, and no field may serialise to `"[object Object]"`.
@@ -0,0 +1,14 @@
1
+ {
2
+ "id": "retrieval-EXAMPLE",
3
+ "kind": "retrieval",
4
+ "_note": "Template (ignored by the runner — filename starts with _). Copy to a real filename and populate from scripts/capture-retrieval-case.mjs, then VERIFY every anchor against the source paper. Anchors are independent ground truth — copied from the paper, never from PubCrawl's own output.",
5
+ "recordId": "PMID or NCT id here",
6
+ "recordType": "pubmed",
7
+ "requireFields": ["title", "authors", "year"],
8
+ "anchors": [
9
+ { "field": "authors", "contains": "FirstAuthorSurname" },
10
+ { "field": "authors", "minCount": 5 },
11
+ { "field": "year", "equals": "2020" },
12
+ { "field": "abstract_sections", "contains": "a distinctive verbatim phrase from the abstract" }
13
+ ]
14
+ }
@@ -0,0 +1,56 @@
1
+ {
2
+ "id": "case-authoryear-harvard",
3
+ "title": "Author-year (Harvard) citation style — thyroid eye disease",
4
+ "notes": "SYNTHETIC manuscript exercising pure author-year citations. Unlike numeric [N] markers, author-year mentions are grammatical prose and are never stripped: text === originalText.",
5
+ "manuscript": "Teprotumumab reduced proptosis by 2.82 mm versus placebo (Douglas et al., 2020). Smith and Jones (2019) reported durable responses at 52 weeks in an open-label extension. The safety profile was consistent with earlier findings (Kim, 2023). This study aimed to characterise long-term outcomes in routine practice.",
6
+ "goldClaims": [
7
+ {
8
+ "originalText": "Teprotumumab reduced proptosis by 2.82 mm versus placebo (Douglas et al., 2020).",
9
+ "text": "Teprotumumab reduced proptosis by 2.82 mm versus placebo (Douglas et al., 2020).",
10
+ "citations": [
11
+ "Douglas 2020"
12
+ ]
13
+ },
14
+ {
15
+ "originalText": "Smith and Jones (2019) reported durable responses at 52 weeks in an open-label extension.",
16
+ "text": "Smith and Jones (2019) reported durable responses at 52 weeks in an open-label extension.",
17
+ "citations": [
18
+ "Smith 2019"
19
+ ]
20
+ },
21
+ {
22
+ "originalText": "The safety profile was consistent with earlier findings (Kim, 2023).",
23
+ "text": "The safety profile was consistent with earlier findings (Kim, 2023).",
24
+ "citations": [
25
+ "Kim 2023"
26
+ ]
27
+ }
28
+ ],
29
+ "goldNonClaims": [
30
+ "This study aimed to characterise long-term outcomes in routine practice."
31
+ ],
32
+ "references": {
33
+ "Douglas 2020": {
34
+ "name": "douglas-2020-nejm.txt",
35
+ "text": "In this randomised trial of teprotumumab for thyroid eye disease, the least-squares mean reduction in proptosis was 2.82 mm greater with teprotumumab than with placebo at week 24."
36
+ },
37
+ "Smith 2019": {
38
+ "name": "smith-2019-extension.txt",
39
+ "text": "In the 52-week open-label extension, proptosis responses were maintained in the majority of participants, indicating durable treatment effect."
40
+ }
41
+ },
42
+ "goldVerdicts": [
43
+ {
44
+ "_requires": "online",
45
+ "claimText": "Teprotumumab reduced proptosis by 2.82 mm versus placebo (Douglas et al., 2020).",
46
+ "citation": "Douglas 2020",
47
+ "verdict": "strong_support"
48
+ },
49
+ {
50
+ "_requires": "online",
51
+ "claimText": "Smith and Jones (2019) reported durable responses at 52 weeks in an open-label extension.",
52
+ "citation": "Smith 2019",
53
+ "verdict": "strong_support"
54
+ }
55
+ ]
56
+ }
@@ -0,0 +1,48 @@
1
+ {
2
+ "id": "case-authoryear-mixed",
3
+ "title": "Mixed numeric + author-year citations in one manuscript",
4
+ "notes": "SYNTHETIC. Exercises numeric superscript and author-year styles side by side, an apostrophe surname (O'Connor), and a 2020a/2020b suffix pair that must stay distinct.",
5
+ "manuscript": "Adalimumab improved ACR20 response rates versus placebo.1,2 O'Connor et al. (2021) confirmed the effect in a real-world cohort. Retention was higher with combination therapy (Meyer 2020a; Meyer 2020b). Baseline characteristics are summarised in Table 1.",
6
+ "goldClaims": [
7
+ {
8
+ "originalText": "Adalimumab improved ACR20 response rates versus placebo.1,2",
9
+ "text": "Adalimumab improved ACR20 response rates versus placebo.",
10
+ "citations": [
11
+ 1,
12
+ 2
13
+ ]
14
+ },
15
+ {
16
+ "originalText": "O'Connor et al. (2021) confirmed the effect in a real-world cohort.",
17
+ "text": "O'Connor et al. (2021) confirmed the effect in a real-world cohort.",
18
+ "citations": [
19
+ "O'Connor 2021"
20
+ ]
21
+ },
22
+ {
23
+ "originalText": "Retention was higher with combination therapy (Meyer 2020a; Meyer 2020b).",
24
+ "text": "Retention was higher with combination therapy (Meyer 2020a; Meyer 2020b).",
25
+ "citations": [
26
+ "Meyer 2020a",
27
+ "Meyer 2020b"
28
+ ]
29
+ }
30
+ ],
31
+ "goldNonClaims": [
32
+ "Baseline characteristics are summarised in Table 1."
33
+ ],
34
+ "references": {
35
+ "O'Connor 2021": {
36
+ "name": "oconnor-2021-cohort.txt",
37
+ "text": "In this prospective real-world cohort of 412 patients with rheumatoid arthritis, adalimumab was associated with significantly improved ACR20 response rates compared with historical placebo controls."
38
+ }
39
+ },
40
+ "goldVerdicts": [
41
+ {
42
+ "_requires": "online",
43
+ "claimText": "O'Connor et al. (2021) confirmed the effect in a real-world cohort.",
44
+ "citation": "O'Connor 2021",
45
+ "verdict": "strong_support"
46
+ }
47
+ ]
48
+ }
@@ -0,0 +1,15 @@
1
+ {
2
+ "id": "retrieval-macra-vascular",
3
+ "kind": "retrieval",
4
+ "notes": "PMID 31904519 — Ann Vasc Surg 2020. A four-author paper: anchors are chosen to fail exactly where PubCrawl's known-risky XML paths would regress — author-array collapse (minCount), surname parsing (contains, first token of PubCrawl's 'Surname ForeName' format), and abstract-section truncation (distinctive opening phrase). Verified against the PubMed record.",
5
+ "recordId": "31904519",
6
+ "recordType": "pubmed",
7
+ "requireFields": ["title", "authors", "year"],
8
+ "anchors": [
9
+ { "field": "authors", "contains": "Haurani" },
10
+ { "field": "authors", "contains": "Satiani" },
11
+ { "field": "authors", "minCount": 4 },
12
+ { "field": "year", "equals": "2020" },
13
+ { "field": "abstract_sections", "contains": "The Medicare Access and CHIP Reauthorization Act" }
14
+ ]
15
+ }
@@ -0,0 +1,45 @@
1
+ {
2
+ "id": "simplify-discharge-brief",
3
+ "kind": "simplification",
4
+ "description": "SYNTHETIC discharge note. Brief mode exercises Patiently's documented contract: 2-3 bullets, under 20 words each. Anchors are the clinically critical facts a patient must not lose.",
5
+ "audience": "Adult",
6
+ "tone": "Informative",
7
+ "length": "Brief",
8
+ "text": "Discharge summary: You were admitted with community-acquired pneumonia. You have been started on amoxicillin 500 mg three times daily for 5 days. Please see your GP in 2 weeks for review. Return to hospital if you develop worsening breathlessness or a fever above 38 degrees.",
9
+ "anchors": [
10
+ {
11
+ "value": "amoxicillin"
12
+ },
13
+ {
14
+ "value": "500 mg",
15
+ "aliases": [
16
+ "500mg",
17
+ "500 milligrams"
18
+ ]
19
+ },
20
+ {
21
+ "value": "5 days",
22
+ "aliases": [
23
+ "five days"
24
+ ]
25
+ },
26
+ {
27
+ "value": "2 weeks",
28
+ "aliases": [
29
+ "two weeks"
30
+ ]
31
+ },
32
+ {
33
+ "value": "38",
34
+ "aliases": [
35
+ "38C",
36
+ "38 degrees"
37
+ ]
38
+ }
39
+ ],
40
+ "allowedNewNumbers": [
41
+ "3"
42
+ ],
43
+ "maxBullets": 3,
44
+ "maxWordsPerBullet": 20
45
+ }
@@ -0,0 +1,41 @@
1
+ {
2
+ "id": "simplify-lab-results",
3
+ "kind": "simplification",
4
+ "description": "SYNTHETIC lab letter. Standard length, Reassuring tone. Values and the new medication must survive; no invented numbers.",
5
+ "audience": "Adult",
6
+ "tone": "Reassuring",
7
+ "length": "Standard",
8
+ "text": "Blood test results: your haemoglobin was 9.8 g/dL (below the normal range) and your ferritin was 8 micrograms/L (low), consistent with iron-deficiency anaemia. We recommend starting ferrous sulfate 200 mg twice daily. We will repeat the blood count in 8 weeks.",
9
+ "anchors": [
10
+ {
11
+ "value": "9.8"
12
+ },
13
+ {
14
+ "value": "ferritin"
15
+ },
16
+ {
17
+ "value": "ferrous sulfate",
18
+ "aliases": [
19
+ "ferrous sulphate",
20
+ "iron tablets",
21
+ "iron supplement"
22
+ ]
23
+ },
24
+ {
25
+ "value": "200 mg",
26
+ "aliases": [
27
+ "200mg",
28
+ "200 milligrams"
29
+ ]
30
+ },
31
+ {
32
+ "value": "8 weeks",
33
+ "aliases": [
34
+ "eight weeks"
35
+ ]
36
+ }
37
+ ],
38
+ "allowedNewNumbers": [
39
+ "2"
40
+ ]
41
+ }
@@ -0,0 +1,34 @@
1
+ {
2
+ "id": "simplify-medication-change",
3
+ "kind": "simplification",
4
+ "description": "SYNTHETIC dose-change letter. The new dose is the fact that must never be lost or altered.",
5
+ "audience": "Adult",
6
+ "tone": "Informative",
7
+ "length": "Standard",
8
+ "text": "Your levothyroxine dose has been increased from 50 micrograms to 100 micrograms once daily, following your recent TSH result of 8.2 mU/L. Please take the new dose each morning before breakfast. We will repeat your thyroid function tests in 6 to 8 weeks.",
9
+ "anchors": [
10
+ {
11
+ "value": "levothyroxine"
12
+ },
13
+ {
14
+ "value": "100 micrograms",
15
+ "aliases": [
16
+ "100 mcg",
17
+ "100mcg",
18
+ "100 µg"
19
+ ]
20
+ },
21
+ {
22
+ "value": "morning"
23
+ },
24
+ {
25
+ "value": "6 to 8 weeks",
26
+ "aliases": [
27
+ "6-8 weeks",
28
+ "six to eight weeks",
29
+ "6–8 weeks"
30
+ ]
31
+ }
32
+ ],
33
+ "allowedNewNumbers": []
34
+ }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@pharmatools/opengate",
3
- "version": "0.2.1",
3
+ "version": "0.4.0",
4
4
  "type": "module",
5
5
  "description": "OpenGATE — Open Grounded AI Testing & Evaluation. An open-source framework for evaluating evidence-grounded AI systems.",
6
6
  "license": "MIT",
@@ -0,0 +1,64 @@
1
+ // Patiently AI adapter — third bundled implementation, exercising the
2
+ // framework's simplify capability. Patiently AI (getpatiently.ai) converts
3
+ // clinical text into patient-friendly language via a Firebase Cloud Function.
4
+ //
5
+ // Config via env (the endpoint is public; no token needed):
6
+ // PATIENTLY_API_URL override the translate endpoint
7
+ // PATIENTLY_EVAL_MODEL optional label recorded in the scorecard
8
+ //
9
+ // Request contract: POST { text, language, audience, tone, length }
10
+ // Response contract: { optimisedText }
11
+ // Patiently-flavoured keys: audience Child|Teenager|Adult|Carer,
12
+ // tone Friendly|Reassuring|Informative, length Brief|Standard|Detailed.
13
+
14
+ export const meta = { name: 'patiently' };
15
+
16
+ const DEFAULT_URL = 'https://us-central1-medicaltextoptimiser.cloudfunctions.net/translate';
17
+ const URL = process.env.PATIENTLY_API_URL || DEFAULT_URL;
18
+
19
+ export function onlineAvailable() {
20
+ return Boolean(URL);
21
+ }
22
+
23
+ export function onlineConfigHint() {
24
+ return 'Set PATIENTLY_API_URL (or unset it to use the production endpoint).';
25
+ }
26
+
27
+ export function runModel() {
28
+ return process.env.PATIENTLY_EVAL_MODEL || null;
29
+ }
30
+
31
+ // ── Timing capture ──────────────────────────────────────────────────────
32
+ const _calls = [];
33
+ export function resetTiming() { _calls.length = 0; }
34
+ export function callLatencies() { return _calls.map(c => c.ms); }
35
+
36
+ /**
37
+ * Simplify capability.
38
+ * @param {object} req — { text, audience?, tone?, length?, language? }
39
+ * @returns {Promise<{ text: string }>}
40
+ */
41
+ export async function simplify(req) {
42
+ const t0 = performance.now();
43
+ try {
44
+ const res = await fetch(URL, {
45
+ method: 'POST',
46
+ headers: { 'Content-Type': 'application/json' },
47
+ body: JSON.stringify({
48
+ text: req.text,
49
+ language: req.language || 'en',
50
+ audience: req.audience || 'Adult',
51
+ tone: req.tone || 'Informative',
52
+ length: req.length || 'Standard',
53
+ }),
54
+ });
55
+ const data = await res.json().catch(() => ({}));
56
+ if (!res.ok) throw new Error(data.error || `translate → HTTP ${res.status}`);
57
+ if (typeof data.optimisedText !== 'string') {
58
+ throw new Error('translate response missing optimisedText');
59
+ }
60
+ return { text: data.optimisedText };
61
+ } finally {
62
+ _calls.push({ ms: performance.now() - t0 });
63
+ }
64
+ }
@@ -0,0 +1,107 @@
1
+ // PubCrawl adapter — fourth bundled implementation, exercising the framework's
2
+ // retrieval capability. PubCrawl (@pharmatools/pubcrawl) is an MCP server that
3
+ // gives AI clients access to PubMed, ClinicalTrials.gov, and drug labelling.
4
+ //
5
+ // It is NOT an AI system — it's deterministic wrappers around public APIs. The
6
+ // eval measures RETRIEVAL FIDELITY: does the record a client receives match the
7
+ // authority (right title, all authors, intact abstract)? A silent XML-parser
8
+ // regression here would poison every downstream grounding claim, so this is the
9
+ // foundation the QA/simplify capabilities build on.
10
+ //
11
+ // The adapter talks to PubCrawl through its real MCP interface (no
12
+ // re-implementation of the tool handlers), so the full production parse path is
13
+ // under test. The MCP SDK is imported dynamically — install it alongside
14
+ // PubCrawl; OpenGATE core stays zero-dependency.
15
+ //
16
+ // Config via env:
17
+ // PUBCRAWL_MCP_URL HTTP MCP endpoint, e.g. http://localhost:8080/mcp
18
+ // (start PubCrawl with `npm run start:http`)
19
+ // PUBCRAWL_CMD stdio spawn command (default: "npx")
20
+ // PUBCRAWL_ARGS stdio args, space-separated (default: "-y @pharmatools/pubcrawl")
21
+ // NCBI_API_KEY forwarded to the server (higher NCBI rate limit)
22
+ // If PUBCRAWL_MCP_URL is set it wins; otherwise the server is spawned over stdio.
23
+
24
+ export const meta = { name: 'pubcrawl' };
25
+
26
+ let sdk = null;
27
+ let loadError = null;
28
+ try {
29
+ const [{ Client }, stdio, http] = await Promise.all([
30
+ import('@modelcontextprotocol/sdk/client/index.js'),
31
+ import('@modelcontextprotocol/sdk/client/stdio.js'),
32
+ import('@modelcontextprotocol/sdk/client/streamableHttp.js').catch(() => ({})),
33
+ ]);
34
+ sdk = { Client, StdioClientTransport: stdio.StdioClientTransport, StreamableHTTPClientTransport: http.StreamableHTTPClientTransport };
35
+ } catch (err) {
36
+ loadError = err;
37
+ }
38
+
39
+ const HTTP_URL = process.env.PUBCRAWL_MCP_URL;
40
+
41
+ export function onlineAvailable() {
42
+ return Boolean(sdk);
43
+ }
44
+
45
+ export function onlineConfigHint() {
46
+ if (!sdk) {
47
+ return 'MCP SDK not installed — run: npm install --no-save @modelcontextprotocol/sdk' +
48
+ (loadError ? ` (${loadError.code || loadError.message})` : '');
49
+ }
50
+ return 'PubCrawl adapter ready (set PUBCRAWL_MCP_URL for HTTP, or it spawns the server over stdio).';
51
+ }
52
+
53
+ // ── Timing ──────────────────────────────────────────────────────────────
54
+ const _calls = [];
55
+ export function resetTiming() { _calls.length = 0; }
56
+ export function callLatencies() { return _calls.map(c => c.ms); }
57
+
58
+ function newTransport() {
59
+ if (HTTP_URL) {
60
+ if (!sdk.StreamableHTTPClientTransport) throw new Error('HTTP transport unavailable in this SDK build');
61
+ return new sdk.StreamableHTTPClientTransport(new URL(HTTP_URL));
62
+ }
63
+ const command = process.env.PUBCRAWL_CMD || 'npx';
64
+ const args = (process.env.PUBCRAWL_ARGS || '-y @pharmatools/pubcrawl').split(/\s+/).filter(Boolean);
65
+ const env = { ...process.env };
66
+ return new sdk.StdioClientTransport({ command, args, env });
67
+ }
68
+
69
+ // PubCrawl tool per record type. Extend as retrieval gold cases grow.
70
+ const TOOL_FOR = {
71
+ pubmed: (id) => ({ name: 'get_abstract', arguments: { pmid: String(id) } }),
72
+ trial: (id) => ({ name: 'get_trial', arguments: { nctId: String(id) } }),
73
+ };
74
+
75
+ /**
76
+ * Retrieval capability. Fetches one record through PubCrawl's MCP interface and
77
+ * returns its parsed JSON as { record }.
78
+ * @param {object} req — { id, type? } (type default: 'pubmed')
79
+ */
80
+ export async function fetchRecord(req) {
81
+ const type = req.type || 'pubmed';
82
+ const build = TOOL_FOR[type];
83
+ if (!build) throw new Error(`unsupported record type "${type}"`);
84
+
85
+ const t0 = performance.now();
86
+ const client = new sdk.Client({ name: 'opengate', version: '0' }, { capabilities: {} });
87
+ const transport = newTransport();
88
+ try {
89
+ await client.connect(transport);
90
+ const res = await client.callTool(build(req.id));
91
+ if (res.isError) {
92
+ const msg = res.content?.map(c => c.text).join(' ') || 'tool error';
93
+ throw new Error(msg);
94
+ }
95
+ const text = (res.content || []).find(c => c.type === 'text')?.text ?? '';
96
+ let record;
97
+ try {
98
+ record = JSON.parse(text);
99
+ } catch {
100
+ throw new Error(`non-JSON tool response: ${text.slice(0, 120)}`);
101
+ }
102
+ return { record };
103
+ } finally {
104
+ _calls.push({ ms: performance.now() - t0 });
105
+ await client.close().catch(() => {});
106
+ }
107
+ }
@@ -24,6 +24,8 @@ const REQUIRED_BASE = ['onlineAvailable', 'onlineConfigHint'];
24
24
  const CAPABILITIES = {
25
25
  qa: ['splitClaims', 'analyzeBatch'], // claim extraction + verdicts against references
26
26
  redaction: ['redact'], // identifier removal from text
27
+ simplify: ['simplify'], // faithful simplification of source text
28
+ retrieval: ['fetchRecord'], // fidelity of retrieved records vs the authority
27
29
  };
28
30
 
29
31
  // Optional: validated if present; no-op defaults are supplied if absent, so
@@ -69,7 +69,9 @@ const ETAL = String.raw`(?:et al\.?|and colleagues|and coworkers)`;
69
69
  /** Detect author-year citations in raw text; returns sorted unique keys. */
70
70
  export function detectAuthorYear(text) {
71
71
  const found = new Set();
72
- const add = (name, year) => found.add(`${name} ${year.slice(0, 4)}`);
72
+ // Year suffixes are kept: "Smith 2020a" and "Smith 2020b" are distinct
73
+ // references in author-year styles.
74
+ const add = (name, year) => found.add(`${name} ${year}`);
73
75
 
74
76
  // 1. Narrative with parenthetical year: "Smith et al. (2020)",
75
77
  // "Smith and Jones (2019)", "Smith (2020)".
package/src/runner.mjs CHANGED
@@ -99,6 +99,8 @@ const SCORERS = [
99
99
  './scorers/claim-extraction.mjs',
100
100
  './scorers/verdict-accuracy.mjs',
101
101
  './scorers/redaction.mjs',
102
+ './scorers/simplification.mjs',
103
+ './scorers/retrieval.mjs',
102
104
  ];
103
105
 
104
106
  async function loadJsonDir(dir, { skipPrefix } = {}) {
@@ -0,0 +1,143 @@
1
+ // ONLINE scorer — retrieval fidelity.
2
+ //
3
+ // Exercises the adapter's fetchRecord() capability against gold cases of kind
4
+ // "retrieval": a stable record ID plus a few HAND-VERIFIED anchor fields.
5
+ // Unlike the model-scoring scorers, the system under test is deterministic
6
+ // (PubCrawl wraps NCBI / ClinicalTrials.gov). The failure that matters here is
7
+ // a PARSE REGRESSION: a record comes back but a field is dropped, collapsed,
8
+ // or garbled — e.g. an authors array that collapses to a single object when a
9
+ // paper has one author, or an abstract whose sections merge or truncate.
10
+ // Because everything downstream (RefCheckr, any RAG) grounds on these records,
11
+ // a silent parse bug poisons every citation built on it.
12
+ //
13
+ // Ground truth is independent of the system: anchors are copied from the
14
+ // paper/record itself, not from PubCrawl's output (see datasets/SCHEMA.md and
15
+ // scripts/capture-retrieval-case.mjs to bootstrap them). The scorer checks:
16
+ //
17
+ // • field presence (gate) — every field named in `requireFields` is present
18
+ // and non-empty (title, authors, year, …). Catches dropped/blank fields.
19
+ // • anchor fidelity (gate) — each anchor must be satisfied by the record:
20
+ // { field: "authors", contains: "Douglas" } surname survived parsing
21
+ // { field: "year", equals: "2020" }
22
+ // { field: "abstract", contains: "proptosis" } distinctive phrase intact
23
+ // { field: "authors", minCount: 5 } array didn't collapse
24
+ // • structural invariants (gate) — authors is a non-empty array of strings
25
+ // (not "[object Object]" from a collapsed single-author record); title is a
26
+ // non-empty string; no field serialises to "[object Object]".
27
+
28
+ export const meta = { id: 'retrieval', mode: 'online' };
29
+
30
+ const norm = (s) => String(s).toLowerCase().replace(/\s+/g, ' ').trim();
31
+
32
+ /** Read a possibly-nested field from a record; arrays are joined for text ops. */
33
+ function fieldValue(record, field) {
34
+ const v = record?.[field];
35
+ if (Array.isArray(v)) return v;
36
+ return v;
37
+ }
38
+ function asText(v) {
39
+ return Array.isArray(v) ? v.map(x => (typeof x === 'string' ? x : JSON.stringify(x))).join(' ') : String(v ?? '');
40
+ }
41
+
42
+ function checkAnchor(record, a) {
43
+ const v = fieldValue(record, a.field);
44
+ if (a.minCount != null) {
45
+ return Array.isArray(v) && v.length >= a.minCount
46
+ ? null : `${a.field} has ${Array.isArray(v) ? v.length : 0} items, expected ≥ ${a.minCount}`;
47
+ }
48
+ if (a.equals != null) {
49
+ return norm(asText(v)) === norm(a.equals) ? null : `${a.field} = "${asText(v)}", expected "${a.equals}"`;
50
+ }
51
+ if (a.contains != null) {
52
+ return norm(asText(v)).includes(norm(a.contains)) ? null : `${a.field} missing "${a.contains}"`;
53
+ }
54
+ return `anchor on ${a.field} has no check (equals/contains/minCount)`;
55
+ }
56
+
57
+ function structuralProblems(record) {
58
+ const problems = [];
59
+ const flat = JSON.stringify(record);
60
+ if (flat.includes('[object Object]')) problems.push('record contains "[object Object]" (a value failed to serialise — likely a parser shape bug)');
61
+ if ('authors' in record) {
62
+ const a = record.authors;
63
+ if (!Array.isArray(a)) problems.push('authors is not an array (single-author collapse?)');
64
+ else if (a.some(x => typeof x !== 'string' || !x.trim())) problems.push('authors contains a non-string / empty entry');
65
+ }
66
+ if ('title' in record && (typeof record.title !== 'string' || !record.title.trim())) {
67
+ problems.push('title is missing or not a non-empty string');
68
+ }
69
+ return problems;
70
+ }
71
+
72
+ export async function run({ cases, adapter }) {
73
+ if (!adapter.capabilities.retrieval) {
74
+ return { meta, skipped: true, reason: `adapter "${adapter.name}" has no retrieval capability` };
75
+ }
76
+ if (!adapter.onlineAvailable()) {
77
+ return { meta, skipped: true, reason: adapter.onlineConfigHint() };
78
+ }
79
+ const goldCases = cases.filter(c => c.kind === 'retrieval' && c.id && (c.anchors || c.requireFields));
80
+ if (goldCases.length === 0) {
81
+ return { meta, skipped: true, reason: 'No cases of kind "retrieval" with anchors/requireFields.' };
82
+ }
83
+
84
+ adapter.resetTiming();
85
+
86
+ const perCase = [];
87
+ const failures = [];
88
+
89
+ for (const c of goldCases) {
90
+ let record;
91
+ try {
92
+ const res = await adapter.fetchRecord({ id: c.recordId, type: c.recordType });
93
+ record = res?.record ?? res;
94
+ } catch (err) {
95
+ failures.push(`case ${c.id}: ${err.message}`);
96
+ continue;
97
+ }
98
+ if (!record || typeof record !== 'object') {
99
+ failures.push(`NO RECORD in ${c.id}: fetchRecord returned nothing usable`);
100
+ continue;
101
+ }
102
+
103
+ const caseFailures = [];
104
+ for (const f of c.requireFields || []) {
105
+ const v = fieldValue(record, f);
106
+ const empty = v == null || (Array.isArray(v) ? v.length === 0 : String(v).trim() === '');
107
+ if (empty) caseFailures.push(`missing field "${f}"`);
108
+ }
109
+ for (const a of c.anchors || []) {
110
+ const problem = checkAnchor(record, a);
111
+ if (problem) caseFailures.push(`anchor ${JSON.stringify(a)} — ${problem}`);
112
+ }
113
+ for (const s of structuralProblems(record)) caseFailures.push(s);
114
+
115
+ for (const cf of caseFailures) failures.push(`FIDELITY ${c.id}: ${cf}`);
116
+ perCase.push({
117
+ case: c.id, recordId: c.recordId,
118
+ anchors: (c.anchors || []).length,
119
+ requireFields: (c.requireFields || []).length,
120
+ problems: caseFailures,
121
+ });
122
+ }
123
+
124
+ const totalChecks = perCase.reduce((a, p) => a + p.anchors + p.requireFields, 0);
125
+ const failedChecks = perCase.reduce((a, p) => a + p.problems.length, 0);
126
+ const latencies = adapter.callLatencies();
127
+
128
+ const metrics = {
129
+ n_cases: perCase.length,
130
+ n_checks: totalChecks,
131
+ fidelity: round(totalChecks ? 1 - Math.min(failedChecks, totalChecks) / totalChecks : (perCase.length && !failedChecks ? 1 : 0)),
132
+ failed_checks: failedChecks,
133
+ ...(latencies.length ? { latency_p50_ms: Math.round(percentileOf(latencies, 50)) } : {}),
134
+ };
135
+
136
+ return { meta, metrics, detail: { perCase }, failures, passed: failures.length === 0 };
137
+ }
138
+
139
+ function percentileOf(values, p) {
140
+ const xs = values.filter(Number.isFinite).sort((a, b) => a - b);
141
+ return xs.length ? xs[Math.min(xs.length - 1, Math.floor((p / 100) * xs.length))] : 0;
142
+ }
143
+ function round(x) { return Math.round(x * 1000) / 1000; }
@@ -0,0 +1,163 @@
1
+ // ONLINE scorer — simplification faithfulness.
2
+ //
3
+ // Exercises the adapter's simplify() capability against gold cases of kind
4
+ // "simplification": a source clinical text plus the facts that must survive
5
+ // simplification. Simplified text is paraphrase BY DESIGN, so verbatim checks
6
+ // on prose are meaningless — instead the scorer measures what is
7
+ // deterministically checkable and clinically dangerous to get wrong:
8
+ //
9
+ // • anchor recall (gate) — critical facts (drug names, doses, key values)
10
+ // must appear in the output, matched case-insensitively with per-anchor
11
+ // aliases ("5 mg" ≈ "5mg"). A dropped dose is the failure that matters.
12
+ // • fabricated numbers (gate) — every number in the output must exist in
13
+ // the source, an anchor/alias, or the case's allowedNewNumbers list.
14
+ // An invented value in patient-facing text is the worst failure there is.
15
+ // • length contract (gate, when the case specifies) — e.g. Patiently's
16
+ // Brief mode: ≤3 bullets, ≤20 words per bullet. Contract drift here
17
+ // previously hid silent prompt-fallback bugs.
18
+ // • readability (info) — Flesch-Kincaid grade of the output; reported,
19
+ // not gated, in v1.
20
+ //
21
+ // Case schema (datasets/SCHEMA.md):
22
+ // { "id", "kind": "simplification", "text", "audience", "tone", "length",
23
+ // "anchors": [{ "value", "aliases": [] }],
24
+ // "allowedNewNumbers": ["2"], "maxBullets": 3, "maxWordsPerBullet": 20 }
25
+
26
+ import { mean } from '../lib/metrics.mjs';
27
+
28
+ export const meta = { id: 'simplification', mode: 'online' };
29
+
30
+ const norm = (s) => String(s).toLowerCase().replace(/\s+/g, ' ');
31
+ /** Whitespace-tolerant, case-insensitive containment. */
32
+ function contains(haystack, needle) {
33
+ const h = norm(haystack).replace(/\s/g, '');
34
+ const n = norm(needle).replace(/\s/g, '');
35
+ return n.length > 0 && h.includes(n);
36
+ }
37
+
38
+ const NUM_RE = /\d+(?:\.\d+)?/g;
39
+ const numbersIn = (s) => new Set((String(s).match(NUM_RE) || []).map(n => n.replace(/^0+(?=\d)/, '')));
40
+
41
+ /** Lines that look like bullets: -, •, *, or "1." style. */
42
+ function bulletLines(text) {
43
+ return String(text).split(/\r?\n/).map(l => l.trim())
44
+ .filter(l => /^([-•*]|\d+[.)])\s+/.test(l));
45
+ }
46
+
47
+ // Flesch-Kincaid grade with a vowel-group syllable heuristic. Approximate,
48
+ // which is why readability is reported rather than gated.
49
+ function fleschKincaidGrade(text) {
50
+ const words = String(text).toLowerCase().match(/[a-z]+/g) || [];
51
+ const sentences = Math.max(1, (String(text).match(/[.!?]+/g) || []).length);
52
+ if (!words.length) return null;
53
+ let syllables = 0;
54
+ for (const w of words) {
55
+ const groups = (w.replace(/e$/, '').match(/[aeiouy]+/g) || []).length;
56
+ syllables += Math.max(1, groups);
57
+ }
58
+ return 0.39 * (words.length / sentences) + 11.8 * (syllables / words.length) - 15.59;
59
+ }
60
+
61
+ export async function run({ cases, adapter }) {
62
+ if (!adapter.capabilities.simplify) {
63
+ return { meta, skipped: true, reason: `adapter "${adapter.name}" has no simplify capability` };
64
+ }
65
+ if (!adapter.onlineAvailable()) {
66
+ return { meta, skipped: true, reason: adapter.onlineConfigHint() };
67
+ }
68
+ const goldCases = cases.filter(c => c.kind === 'simplification' && (c.anchors || []).length);
69
+ if (goldCases.length === 0) {
70
+ return { meta, skipped: true, reason: 'No cases of kind "simplification" with anchors.' };
71
+ }
72
+
73
+ adapter.resetTiming();
74
+
75
+ const perCase = [];
76
+ const failures = [];
77
+
78
+ for (const c of goldCases) {
79
+ let out;
80
+ try {
81
+ const res = await adapter.simplify({
82
+ text: c.text, audience: c.audience, tone: c.tone, length: c.length, language: c.language,
83
+ });
84
+ out = res.text ?? '';
85
+ } catch (err) {
86
+ failures.push(`case ${c.id}: ${err.message}`);
87
+ continue;
88
+ }
89
+
90
+ // Anchor recall: value or any alias must survive.
91
+ const missed = (c.anchors || []).filter(a =>
92
+ ![a.value, ...(a.aliases || [])].some(v => contains(out, v)));
93
+ for (const a of missed) {
94
+ failures.push(`DROPPED FACT in ${c.id}: anchor "${a.value}" absent from simplified output`);
95
+ }
96
+
97
+ // Fabricated numbers: output numbers must come from somewhere legitimate.
98
+ const legitimate = new Set([
99
+ ...numbersIn(c.text),
100
+ ...(c.anchors || []).flatMap(a => [...numbersIn(a.value), ...(a.aliases || []).flatMap(x => [...numbersIn(x)])]),
101
+ ...(c.allowedNewNumbers || []).map(String),
102
+ ]);
103
+ const fabricated = [...numbersIn(out)].filter(n => !legitimate.has(n));
104
+ for (const n of fabricated) {
105
+ failures.push(`FABRICATED NUMBER in ${c.id}: "${n}" appears in output but not in source`);
106
+ }
107
+
108
+ // Length contract (only when the case declares one).
109
+ const bullets = bulletLines(out);
110
+ const contractViolations = [];
111
+ if (c.maxBullets != null && bullets.length > c.maxBullets) {
112
+ contractViolations.push(`${bullets.length} bullets > max ${c.maxBullets}`);
113
+ }
114
+ if (c.maxWordsPerBullet != null) {
115
+ for (const b of bullets) {
116
+ const words = b.replace(/^([-•*]|\d+[.)])\s+/, '').split(/\s+/).filter(Boolean).length;
117
+ if (words > c.maxWordsPerBullet) contractViolations.push(`bullet has ${words} words > max ${c.maxWordsPerBullet}`);
118
+ }
119
+ }
120
+ if (c.maxBullets != null && bullets.length === 0) {
121
+ contractViolations.push('bullet output expected, none found');
122
+ }
123
+ for (const v of contractViolations) failures.push(`CONTRACT in ${c.id}: ${v}`);
124
+
125
+ perCase.push({
126
+ case: c.id,
127
+ anchors: (c.anchors || []).length,
128
+ anchorsMissed: missed.map(a => a.value),
129
+ fabricated,
130
+ contractViolations,
131
+ grade: round(fleschKincaidGrade(out) ?? -1),
132
+ outputChars: out.length,
133
+ bullets: bullets.length,
134
+ // The output itself, so a dropped-fact failure can be diagnosed from the
135
+ // snapshot (was it omitted, reworded past the aliases, or replaced?).
136
+ output: out.slice(0, 600),
137
+ });
138
+ }
139
+
140
+ const totalAnchors = perCase.reduce((a, p) => a + p.anchors, 0);
141
+ const totalMissed = perCase.reduce((a, p) => a + p.anchorsMissed.length, 0);
142
+ const latencies = adapter.callLatencies();
143
+
144
+ const metrics = {
145
+ n_cases: perCase.length,
146
+ n_anchors: totalAnchors,
147
+ anchor_recall: round(totalAnchors ? 1 - totalMissed / totalAnchors : 0),
148
+ dropped_facts: totalMissed,
149
+ fabricated_numbers: perCase.reduce((a, p) => a + p.fabricated.length, 0),
150
+ contract_violations: perCase.reduce((a, p) => a + p.contractViolations.length, 0),
151
+ mean_grade: round(mean(perCase.map(p => p.grade).filter(g => g >= 0))),
152
+ ...(latencies.length ? { latency_p50_ms: Math.round(percentileOf(latencies, 50)) } : {}),
153
+ ...(adapter.runModel() ? { run_model: adapter.runModel() } : {}),
154
+ };
155
+
156
+ return { meta, metrics, detail: { perCase }, failures, passed: failures.length === 0 };
157
+ }
158
+
159
+ function percentileOf(values, p) {
160
+ const xs = values.filter(Number.isFinite).sort((a, b) => a - b);
161
+ return xs.length ? xs[Math.min(xs.length - 1, Math.floor((p / 100) * xs.length))] : 0;
162
+ }
163
+ function round(x) { return Math.round(x * 100) / 100; }