healthcare-agents 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/eval.md +350 -0
- package/AGENTS.md +26 -0
- package/CHANGELOG.md +105 -0
- package/INSTALL.md +274 -0
- package/LICENSE +189 -0
- package/README.md +479 -0
- package/VERSION +1 -0
- package/agents/clinical-care-management-specialist.md +432 -0
- package/agents/clinical-case-manager.md +439 -0
- package/agents/clinical-documentation-improvement-specialist.md +442 -0
- package/agents/clinical-infection-prevention-specialist.md +434 -0
- package/agents/clinical-prior-authorization-specialist.md +436 -0
- package/agents/clinical-referral-specialist.md +455 -0
- package/agents/clinical-research-coordinator.md +445 -0
- package/agents/clinical-utilization-management-specialist.md +442 -0
- package/agents/emergency-preparedness-coordinator.md +468 -0
- package/agents/healthit-clinical-data-analyst.md +469 -0
- package/agents/healthit-epic-applications-analyst.md +466 -0
- package/agents/healthit-informatics-manager.md +582 -0
- package/agents/healthit-information-manager.md +461 -0
- package/agents/healthit-interoperability-engineer.md +518 -0
- package/agents/healthit-telehealth-program-manager.md +462 -0
- package/agents/operations-ambulatory-manager.md +432 -0
- package/agents/operations-home-health-administrator.md +433 -0
- package/agents/operations-hospital-administrator.md +448 -0
- package/agents/operations-long-term-care-administrator.md +445 -0
- package/agents/operations-physician-practice-manager.md +438 -0
- package/agents/operations-supply-chain-manager.md +435 -0
- package/agents/operations-workforce-manager.md +475 -0
- package/agents/payer-credentialing-enrollment-coordinator.md +470 -0
- package/agents/payer-managed-care-analyst.md +471 -0
- package/agents/payer-medicare-medicaid-specialist.md +437 -0
- package/agents/payer-medicare-outreach-coordinator.md +469 -0
- package/agents/payer-relations-specialist.md +443 -0
- package/agents/payer-value-based-care-manager.md +436 -0
- package/agents/pharmacy-benefits-specialist.md +441 -0
- package/agents/pharmacy-medication-safety-specialist.md +455 -0
- package/agents/pophealth-community-health-coordinator.md +573 -0
- package/agents/pophealth-population-health-manager.md +507 -0
- package/agents/pophealth-surveillance-coordinator.md +484 -0
- package/agents/quality-accreditation-specialist.md +439 -0
- package/agents/quality-compliance-officer.md +490 -0
- package/agents/quality-improvement-specialist.md +441 -0
- package/agents/quality-patient-experience-coordinator.md +439 -0
- package/agents/quality-patient-safety-officer.md +456 -0
- package/agents/quality-process-improvement-analyst.md +532 -0
- package/agents/quality-risk-manager.md +453 -0
- package/agents/registry.json +2896 -0
- package/agents/revenue-340b-program-manager.md +520 -0
- package/agents/revenue-chargemaster-analyst.md +459 -0
- package/agents/revenue-contract-analyst.md +534 -0
- package/agents/revenue-cycle-specialist.md +476 -0
- package/agents/revenue-finance-manager.md +559 -0
- package/agents/revenue-medical-coding-specialist.md +530 -0
- package/agents/strategy-actuarial-advisor.md +467 -0
- package/agents/strategy-clinical-operations-consultant.md +440 -0
- package/agents/strategy-healthcare-consultant.md +423 -0
- package/agents/strategy-operations-consultant.md +481 -0
- package/agents/strategy-structural-improvement-consultant.md +468 -0
- package/bin/cli.js +476 -0
- package/docs/assets/healthcare-agents-hero.png +0 -0
- package/docs/eval/exam-architect-playbook.md +350 -0
- package/docs/eval/model-tuning.md +99 -0
- package/docs/eval/scorecard.json +519 -0
- package/docs/eval/scorecard.md +69 -0
- package/docs/eval/usability-release-check.md +33 -0
- package/docs/release-notes/2026-04-09-eval-loop-milestone.md +88 -0
- package/docs/release-notes/2026-04-23-agent-stack-optimization.md +68 -0
- package/docs/release-notes/2026-04-23-github-npx-install.md +22 -0
- package/docs/release-notes/2026-04-23-install-compatibility.md +20 -0
- package/docs/release-notes/2026-05-05-usability-release.md +54 -0
- package/docs/trust-and-safety.md +55 -0
- package/docs/usage/agent-selection-guide.md +84 -0
- package/docs/usage/handoff-map.md +41 -0
- package/docs/usage/starter-prompts.md +65 -0
- package/eval/meta/README.md +41 -0
- package/eval/meta/judge-calibration-cases.md +124 -0
- package/eval/meta/prompt-overfitting-check.md +92 -0
- package/eval/meta/scorer-consistency-check.md +110 -0
- package/eval/results.tsv +20 -0
- package/eval/role-baselines/INDEX.md +67 -0
- package/eval/role-baselines/clinical-care-management-specialist.md +49 -0
- package/eval/role-baselines/clinical-case-manager.md +49 -0
- package/eval/role-baselines/clinical-documentation-improvement-specialist.md +49 -0
- package/eval/role-baselines/clinical-infection-prevention-specialist.md +49 -0
- package/eval/role-baselines/clinical-prior-authorization-specialist.md +49 -0
- package/eval/role-baselines/clinical-referral-specialist.md +49 -0
- package/eval/role-baselines/clinical-research-coordinator.md +49 -0
- package/eval/role-baselines/clinical-utilization-management-specialist.md +49 -0
- package/eval/role-baselines/emergency-preparedness-coordinator.md +49 -0
- package/eval/role-baselines/healthit-clinical-data-analyst.md +49 -0
- package/eval/role-baselines/healthit-epic-applications-analyst.md +49 -0
- package/eval/role-baselines/healthit-informatics-manager.md +49 -0
- package/eval/role-baselines/healthit-information-manager.md +49 -0
- package/eval/role-baselines/healthit-interoperability-engineer.md +49 -0
- package/eval/role-baselines/healthit-telehealth-program-manager.md +49 -0
- package/eval/role-baselines/operations-ambulatory-manager.md +58 -0
- package/eval/role-baselines/operations-home-health-administrator.md +58 -0
- package/eval/role-baselines/operations-hospital-administrator.md +55 -0
- package/eval/role-baselines/operations-long-term-care-administrator.md +57 -0
- package/eval/role-baselines/operations-physician-practice-manager.md +54 -0
- package/eval/role-baselines/operations-supply-chain-manager.md +55 -0
- package/eval/role-baselines/operations-workforce-manager.md +55 -0
- package/eval/role-baselines/payer-credentialing-enrollment-coordinator.md +56 -0
- package/eval/role-baselines/payer-managed-care-analyst.md +54 -0
- package/eval/role-baselines/payer-medicare-medicaid-specialist.md +54 -0
- package/eval/role-baselines/payer-medicare-outreach-coordinator.md +55 -0
- package/eval/role-baselines/payer-relations-specialist.md +54 -0
- package/eval/role-baselines/payer-value-based-care-manager.md +55 -0
- package/eval/role-baselines/pharmacy-benefits-specialist.md +54 -0
- package/eval/role-baselines/pharmacy-medication-safety-specialist.md +55 -0
- package/eval/role-baselines/pophealth-community-health-coordinator.md +55 -0
- package/eval/role-baselines/pophealth-population-health-manager.md +55 -0
- package/eval/role-baselines/pophealth-surveillance-coordinator.md +55 -0
- package/eval/role-baselines/quality-accreditation-specialist.md +50 -0
- package/eval/role-baselines/quality-compliance-officer.md +50 -0
- package/eval/role-baselines/quality-improvement-specialist.md +50 -0
- package/eval/role-baselines/quality-patient-experience-coordinator.md +50 -0
- package/eval/role-baselines/quality-patient-safety-officer.md +50 -0
- package/eval/role-baselines/quality-process-improvement-analyst.md +50 -0
- package/eval/role-baselines/quality-risk-manager.md +50 -0
- package/eval/role-baselines/revenue-340b-program-manager.md +50 -0
- package/eval/role-baselines/revenue-chargemaster-analyst.md +50 -0
- package/eval/role-baselines/revenue-contract-analyst.md +50 -0
- package/eval/role-baselines/revenue-cycle-specialist.md +50 -0
- package/eval/role-baselines/revenue-finance-manager.md +50 -0
- package/eval/role-baselines/revenue-medical-coding-specialist.md +62 -0
- package/eval/role-baselines/strategy-actuarial-advisor.md +50 -0
- package/eval/role-baselines/strategy-clinical-operations-consultant.md +50 -0
- package/eval/role-baselines/strategy-healthcare-consultant.md +50 -0
- package/eval/role-baselines/strategy-operations-consultant.md +50 -0
- package/eval/role-baselines/strategy-structural-improvement-consultant.md +50 -0
- package/eval/rubric.md +75 -0
- package/eval/run-logs/README.md +36 -0
- package/install.sh +555 -0
- package/package.json +61 -0
- package/scripts/audit-agents.py +219 -0
- package/scripts/generate-scorecard.js +108 -0
- package/scripts/install-self-improvement-kit.sh +154 -0
- package/scripts/lint-agents.sh +127 -0
|
@@ -0,0 +1,350 @@
|
|
|
1
|
+
Evaluate and improve one healthcare agent's system prompt. Run up to 5 iterations of: prepare fixed questions -> answer -> judge -> improve -> re-score -> commit if better.
|
|
2
|
+
|
|
3
|
+
**Target agent:** agents/$ARGUMENTS.md
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Support Docs
|
|
8
|
+
|
|
9
|
+
Read these as operating instructions for the eval run:
|
|
10
|
+
|
|
11
|
+
- `eval/rubric.md` — frozen scoring metric.
|
|
12
|
+
- `eval/role-baselines/$ARGUMENTS.md` — frozen expected-capability baseline for this agent.
|
|
13
|
+
- `eval/role-baselines/INDEX.md` — confirms baseline coverage for all installable agents.
|
|
14
|
+
- `docs/eval/exam-architect-playbook.md` — scorer and question-writing guidance.
|
|
15
|
+
- `docs/eval/model-tuning.md` — model-role and manifest guidance for current SOTA models.
|
|
16
|
+
- `eval/meta/README.md` and linked meta-eval docs — optional calibration checks for release or close-call scoring.
|
|
17
|
+
|
|
18
|
+
Do not use or recreate the retired Python/DSPy harness. The active system is this command, the frozen rubric, role baselines, optional local run logs, and git.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Preferred Execution Mode
|
|
23
|
+
|
|
24
|
+
When the runtime supports native subagents, model routing, or specialist workers, prefer a four-role loop:
|
|
25
|
+
|
|
26
|
+
- **Parent orchestrator** — owns preflight checks, fixed-question persistence within an iteration, line-cap enforcement, local run logs, `eval/results.tsv` append, and commit/revert.
|
|
27
|
+
- **Scorer/judge** — strongest available reasoning model; read-only; generates the exam, scores answers, checks calibration risk, and returns structured critique.
|
|
28
|
+
- **Editor** — faster strong model; edits only `agents/$ARGUMENTS.md` using the scorer's brief.
|
|
29
|
+
- **Adjudicator** — optional different model family for close deltas, high-risk roles, suspicious scoring, or release scoring.
|
|
30
|
+
|
|
31
|
+
Do not recursively invoke a full agent CLI inside itself when native subagents are available. If no subagents are available, a single agent may run the same workflow end-to-end.
|
|
32
|
+
|
|
33
|
+
Use `docs/eval/model-tuning.md` for model-role guidance. Pin exact model IDs in run logs when the runtime exposes them; do not rely only on UI or marketing names.
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Rules
|
|
38
|
+
|
|
39
|
+
- Never modify `eval/rubric.md`.
|
|
40
|
+
- Never modify files in `eval/role-baselines/` during an eval run.
|
|
41
|
+
- Never modify `eval/results.tsv` except to append rows.
|
|
42
|
+
- The editor may edit only `agents/$ARGUMENTS.md`.
|
|
43
|
+
- The parent may write local run artifacts under `eval/run-logs/`.
|
|
44
|
+
- One agent per session. After 5 iterations, print the summary and stop.
|
|
45
|
+
- Each iteration is independent. Read the agent file fresh each time.
|
|
46
|
+
- Scores across iterations are not comparable because questions differ. Only the same-question pre/post delta within an iteration is meaningful.
|
|
47
|
+
- Any before/after or score-only baseline run must persist the full question set before answers are generated. Focus rows, weak-area labels, or paraphrases are not enough for retesting.
|
|
48
|
+
- Do not broaden the role into generic healthcare administration. Preserve the agent's specialty, practitioner voice, and strongest differentiators.
|
|
49
|
+
- Prefer sharpening missing mechanics, formulas, workflows, citations, and deliverables over adding broad best-practices boilerplate.
|
|
50
|
+
- Do not log API keys, secrets, PHI, patient data, or private operational credentials.
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## Preflight Checks
|
|
55
|
+
|
|
56
|
+
Run these checks before entering the loop. If any fail, print the error message and stop immediately.
|
|
57
|
+
|
|
58
|
+
1. Verify not on main:
|
|
59
|
+
Run: `git branch --show-current`
|
|
60
|
+
If the result is `main` or `master`, stop with: "Switch to a feature branch first: `git checkout -b eval/$ARGUMENTS`"
|
|
61
|
+
|
|
62
|
+
2. Verify clean index:
|
|
63
|
+
Run: `git diff --cached --name-only`
|
|
64
|
+
If there is any output, stop with: "You have staged changes. Commit or unstage them first: `git reset HEAD`"
|
|
65
|
+
|
|
66
|
+
3. Verify clean target file:
|
|
67
|
+
Run: `git diff --name-only -- agents/$ARGUMENTS.md`
|
|
68
|
+
If there is any output, stop with: "agents/$ARGUMENTS.md has uncommitted changes. Commit or stash first."
|
|
69
|
+
|
|
70
|
+
4. Verify agent file exists:
|
|
71
|
+
Check that `agents/$ARGUMENTS.md` exists. If not, stop with: "No agent file found at agents/$ARGUMENTS.md"
|
|
72
|
+
|
|
73
|
+
5. Verify required eval files exist:
|
|
74
|
+
Required:
|
|
75
|
+
- `eval/rubric.md`
|
|
76
|
+
- `eval/role-baselines/$ARGUMENTS.md`
|
|
77
|
+
- `docs/eval/exam-architect-playbook.md`
|
|
78
|
+
- `docs/eval/model-tuning.md`
|
|
79
|
+
|
|
80
|
+
If the role baseline is missing, stop and create the baseline first. Baselines are now expected for all 51 installable agents.
|
|
81
|
+
|
|
82
|
+
6. Record session-start line count:
|
|
83
|
+
Run: `wc -l < agents/$ARGUMENTS.md`
|
|
84
|
+
Store as `BASELINE_LINES` for the whole session. Do not recompute it in later iterations.
|
|
85
|
+
Compute `LINE_CAP = max(BASELINE_LINES * 1.2, BASELINE_LINES + 50)`, rounded up.
|
|
86
|
+
Print: "Preflight passed. Baseline: {BASELINE_LINES} lines. Cap: {LINE_CAP} lines."
|
|
87
|
+
|
|
88
|
+
7. Create a local run-log directory:
|
|
89
|
+
Use `eval/run-logs/<timestamp>-<agent-slug>/`.
|
|
90
|
+
Raw run logs are ignored by git by default. See `eval/run-logs/README.md`.
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## The Loop
|
|
95
|
+
|
|
96
|
+
Repeat the following up to 5 times. Number each iteration starting from 1.
|
|
97
|
+
|
|
98
|
+
### Step 1: Read Inputs
|
|
99
|
+
|
|
100
|
+
Read fresh:
|
|
101
|
+
|
|
102
|
+
- `agents/$ARGUMENTS.md`
|
|
103
|
+
- `eval/role-baselines/$ARGUMENTS.md`
|
|
104
|
+
- `eval/rubric.md`
|
|
105
|
+
- `docs/eval/exam-architect-playbook.md`
|
|
106
|
+
- `docs/eval/model-tuning.md`
|
|
107
|
+
|
|
108
|
+
For release scoring, close deltas, or high-risk agents, also read:
|
|
109
|
+
|
|
110
|
+
- `eval/meta/judge-calibration-cases.md`
|
|
111
|
+
- `eval/meta/scorer-consistency-check.md`
|
|
112
|
+
- `eval/meta/prompt-overfitting-check.md`
|
|
113
|
+
|
|
114
|
+
For usability release checks, also read `docs/eval/usability-release-check.md` and run its scenarios as smoke tests. These checks are release-only confidence checks; do not modify `eval/rubric.md`, `eval/role-baselines/`, or append `eval/results.tsv` for a usability smoke run.
|
|
115
|
+
|
|
116
|
+
### Step 2: Prepare 25 Questions
|
|
117
|
+
|
|
118
|
+
Choose the question source in this order:
|
|
119
|
+
|
|
120
|
+
1. If a deliberate train bank exists at `eval/question-banks/train/$ARGUMENTS.md`, use it for weakness discovery.
|
|
121
|
+
2. Otherwise generate 25 fresh questions from the agent prompt, role baseline, and `docs/eval/exam-architect-playbook.md`.
|
|
122
|
+
3. Use validation-bank questions only after a retained edit, and only for confidence checking.
|
|
123
|
+
4. Never use holdout-bank questions to guide edits.
|
|
124
|
+
|
|
125
|
+
Draw questions from both:
|
|
126
|
+
|
|
127
|
+
- The agent prompt: what it claims to know.
|
|
128
|
+
- The role baseline: what the role should know, including omitted responsibilities.
|
|
129
|
+
|
|
130
|
+
Default mix:
|
|
131
|
+
|
|
132
|
+
- 5 factual mechanics
|
|
133
|
+
- 8 applied reasoning
|
|
134
|
+
- 5 edge cases
|
|
135
|
+
- 4 cross-domain scenarios
|
|
136
|
+
- 3 deliverable-production prompts
|
|
137
|
+
|
|
138
|
+
Persist the exact questions for the iteration in the run log before answering them. Reuse these same questions for the post-edit re-score within the same iteration.
|
|
139
|
+
|
|
140
|
+
The question artifact is mandatory. Write `questions.md` and, when practical, `questions.json` with all 25 full questions. Each question must include:
|
|
141
|
+
|
|
142
|
+
- stable question ID, `Q001` through `Q025`
|
|
143
|
+
- full prompt text exactly as shown to the answerer
|
|
144
|
+
- question type: factual mechanics, applied reasoning, edge case, cross-domain scenario, or deliverable production
|
|
145
|
+
- source basis: agent prompt, role baseline, playbook blueprint, train bank, validation bank, holdout bank, or calibration case
|
|
146
|
+
- expected coverage: the capabilities, source families, or deliverable elements the answer should address
|
|
147
|
+
- scoring emphasis: accuracy, completeness, specificity, or mixed
|
|
148
|
+
|
|
149
|
+
Do not substitute short focus labels such as "network adequacy reporting" for the full prompt. If the full question text is not recoverable later, any retest must be labeled `fresh-comparable`, not `baseline-exact` or `same-question`.
|
|
150
|
+
|
|
151
|
+
### Step 3: Answer All Questions
|
|
152
|
+
|
|
153
|
+
Answer each question as if you are the target agent, with `agents/$ARGUMENTS.md` as the system prompt.
|
|
154
|
+
|
|
155
|
+
Use only the agent prompt as the answerer's authority. If the prompt does not cover a topic, acknowledge the gap instead of fabricating detail.
|
|
156
|
+
|
|
157
|
+
### Step 4: Judge Each Answer
|
|
158
|
+
|
|
159
|
+
The scorer reads `eval/rubric.md` and scores each question-answer pair independently:
|
|
160
|
+
|
|
161
|
+
- Accuracy: 0-4
|
|
162
|
+
- Completeness: 0-4
|
|
163
|
+
- Specificity: 0-4
|
|
164
|
+
|
|
165
|
+
Apply the rubric strictly:
|
|
166
|
+
|
|
167
|
+
- Accuracy 3+ requires specific codes, sections, standards, source families, or concrete authority references when the role calls for them.
|
|
168
|
+
- Accuracy 4 requires the most specific authority level the prompt makes available.
|
|
169
|
+
- Do not grant high accuracy for invented citations.
|
|
170
|
+
- Do not reward verbosity when the answer is vague or unsafe.
|
|
171
|
+
- Cite evidence from the answer and prompt; do not score from preference alone.
|
|
172
|
+
|
|
173
|
+
Compute weighted score per question:
|
|
174
|
+
|
|
175
|
+
```text
|
|
176
|
+
(Accuracy * 0.40) + (Completeness * 0.35) + (Specificity * 0.25)
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
Average across 25 questions and multiply by 25 for a 0-100 score. This is `score_pre_edit`.
|
|
180
|
+
|
|
181
|
+
The scorer output must include structured criterion detail:
|
|
182
|
+
|
|
183
|
+
```json
|
|
184
|
+
{
|
|
185
|
+
"question_source": "generated | train-bank | validation-bank | holdout-bank | calibration",
|
|
186
|
+
"score_pre_edit": 0,
|
|
187
|
+
"criteria_summary": {
|
|
188
|
+
"accuracy": {"score": 0, "evidence": "..."},
|
|
189
|
+
"completeness": {"score": 0, "evidence": "..."},
|
|
190
|
+
"specificity": {"score": 0, "evidence": "..."}
|
|
191
|
+
},
|
|
192
|
+
"weaknesses": [
|
|
193
|
+
{
|
|
194
|
+
"criterion": "accuracy | completeness | specificity",
|
|
195
|
+
"question_ids": ["Q003"],
|
|
196
|
+
"issue": "...",
|
|
197
|
+
"proposed_prompt_change": "..."
|
|
198
|
+
}
|
|
199
|
+
],
|
|
200
|
+
"identity_to_preserve": [],
|
|
201
|
+
"anti_patterns_to_avoid": []
|
|
202
|
+
}
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Step 5: Produce Improvement Brief
|
|
206
|
+
|
|
207
|
+
From the scores, identify the 2-3 weakest areas. The scorer gives the editor a narrow improvement brief containing:
|
|
208
|
+
|
|
209
|
+
- 2-4 representative low-scoring questions or paraphrased failure patterns.
|
|
210
|
+
- 2-4 targeted prompt changes with expected gain.
|
|
211
|
+
- The rubric criterion each change should improve.
|
|
212
|
+
- `identity_to_preserve`: role traits, sections, and differentiators that must survive editing.
|
|
213
|
+
- `anti_patterns_to_avoid`: generic broadening, duplicated boilerplate, bland executive-speak, or question-specific patching.
|
|
214
|
+
|
|
215
|
+
Do not reveal hidden holdout material or answer-key wording to the editor.
|
|
216
|
+
|
|
217
|
+
### Step 6: Edit The Agent
|
|
218
|
+
|
|
219
|
+
Edit `agents/$ARGUMENTS.md` to strengthen the weak areas identified in Step 5.
|
|
220
|
+
|
|
221
|
+
Editor constraints:
|
|
222
|
+
|
|
223
|
+
- Implement the highest-leverage 1-3 changes first.
|
|
224
|
+
- Prefer adding specific guidance to existing sections over rewriting or reorganizing entire sections.
|
|
225
|
+
- Preserve all items under `identity_to_preserve`.
|
|
226
|
+
- Avoid all listed anti-patterns.
|
|
227
|
+
- Do not edit eval files, baselines, docs, or other agents.
|
|
228
|
+
|
|
229
|
+
After editing, check line count:
|
|
230
|
+
|
|
231
|
+
```bash
|
|
232
|
+
wc -l < agents/$ARGUMENTS.md
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
If the count exceeds `LINE_CAP`:
|
|
236
|
+
|
|
237
|
+
1. Immediately run `git restore agents/$ARGUMENTS.md`.
|
|
238
|
+
2. Append a row to `eval/results.tsv` with status `capped`, `score_post_edit` as `N/A`, and `delta` as `N/A`.
|
|
239
|
+
3. Write a capped summary in the run log.
|
|
240
|
+
4. Run `git add eval/results.tsv && git commit -m "eval: $ARGUMENTS capped (exceeded line limit)"`.
|
|
241
|
+
5. Skip to the next iteration.
|
|
242
|
+
|
|
243
|
+
### Step 7: Re-Score
|
|
244
|
+
|
|
245
|
+
Re-answer the same 25 questions from Step 2 using the edited agent prompt.
|
|
246
|
+
|
|
247
|
+
Re-judge using the same rubric and scoring method. This is `score_post_edit`.
|
|
248
|
+
|
|
249
|
+
If the delta is small, the role is high-risk, or the scorer behavior looks suspect, run the adjudicator or meta-eval checks described in `eval/meta/`.
|
|
250
|
+
|
|
251
|
+
### Step 8: Log And Commit
|
|
252
|
+
|
|
253
|
+
Before any commit/revert decision:
|
|
254
|
+
|
|
255
|
+
1. Write run-log artifacts for the iteration, including:
|
|
256
|
+
- `manifest.json`
|
|
257
|
+
- `questions.md`
|
|
258
|
+
- `questions.json` when practical
|
|
259
|
+
- `scorer-output-pre.json`
|
|
260
|
+
- `editor-brief.md`
|
|
261
|
+
- `scorer-output-post.json`
|
|
262
|
+
- `summary.md`
|
|
263
|
+
2. Verify `questions.md` contains all 25 complete question prompts. If not, stop and repair the artifact before editing or scoring.
|
|
264
|
+
3. Ensure the manifest records exact model IDs when available, git state, file hashes, rubric hash, baseline hash, question source, question artifact paths, line cap, status, and calibration status.
|
|
265
|
+
4. Append a tab-separated row to `eval/results.tsv`:
|
|
266
|
+
|
|
267
|
+
```text
|
|
268
|
+
{iteration}\t{$ARGUMENTS}\t{score_pre_edit}\t{score_post_edit}\t{delta}\t{status}\t{weak_areas}\t{description}
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
The existing TSV schema remains unchanged. Put the run-log directory path in `description` when one exists.
|
|
272
|
+
|
|
273
|
+
Commit decision:
|
|
274
|
+
|
|
275
|
+
- If `score_post_edit > score_pre_edit`:
|
|
276
|
+
|
|
277
|
+
```bash
|
|
278
|
+
git add agents/$ARGUMENTS.md eval/results.tsv
|
|
279
|
+
git commit -m "eval: $ARGUMENTS {score_pre_edit}->{score_post_edit} (+{delta})"
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
- If `score_post_edit <= score_pre_edit`:
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
git restore agents/$ARGUMENTS.md
|
|
286
|
+
git add eval/results.tsv
|
|
287
|
+
git commit -m "eval: $ARGUMENTS reverted ({delta})"
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
Do not commit raw `eval/run-logs/` artifacts unless a human explicitly asks to promote a run log for review.
|
|
291
|
+
|
|
292
|
+
---
|
|
293
|
+
|
|
294
|
+
## Score-Only And Before/After Runs
|
|
295
|
+
|
|
296
|
+
For score-only baselines, multi-agent scorecards, or before/after experiments:
|
|
297
|
+
|
|
298
|
+
1. Create a run directory under `eval/run-logs/<timestamp>-<experiment-name>/`.
|
|
299
|
+
2. For each agent, write the full 25-question set before answering:
|
|
300
|
+
- `baseline/<batch-or-agent>/questions.md`
|
|
301
|
+
- `baseline/<batch-or-agent>/questions.json` when practical
|
|
302
|
+
3. Write baseline answers and scores separately from the question artifact.
|
|
303
|
+
4. Give editors only the improvement brief, not hidden holdout material or answer keys.
|
|
304
|
+
5. During retest, first load the baseline `questions.md`.
|
|
305
|
+
6. If all 25 full prompts are present, retest with exactly those prompts and mark `question_source=baseline-exact`.
|
|
306
|
+
7. If only focus rows, weak areas, or paraphrases are present, do not reconstruct them as exact. Generate comparable questions if useful and mark `question_source=fresh-comparable`.
|
|
307
|
+
8. The final scorecard must include `question_source` for every row.
|
|
308
|
+
|
|
309
|
+
This rule is required because before/after deltas are only defensible when the post-edit score uses the same questions as the pre-edit score.
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
## Completion
|
|
314
|
+
|
|
315
|
+
After 5 iterations, or if interrupted, print:
|
|
316
|
+
|
|
317
|
+
```text
|
|
318
|
+
=== Eval Complete: $ARGUMENTS ===
|
|
319
|
+
Iterations: {N}
|
|
320
|
+
Results: {improved} improved, {reverted} reverted, {capped} capped
|
|
321
|
+
Starting score: {iteration 1 score_pre_edit}
|
|
322
|
+
Retained score on disk: {score_post_edit from most recent improved iteration, or starting score if none improved}
|
|
323
|
+
Last attempted score: {score_post_edit from the last non-capped iteration, or N/A}
|
|
324
|
+
Results log: eval/results.tsv
|
|
325
|
+
Run log: eval/run-logs/<timestamp>-<agent-slug>/
|
|
326
|
+
Models: parent={id}, scorer={id}, editor={id}, adjudicator={id or none}
|
|
327
|
+
Calibration: {not-run | passed | warning | failed}
|
|
328
|
+
```
|
|
329
|
+
|
|
330
|
+
Remember: retained score reflects the actual agent prompt on disk. Last attempted score may include a reverted edit.
|
|
331
|
+
|
|
332
|
+
---
|
|
333
|
+
|
|
334
|
+
## Scaling Pattern
|
|
335
|
+
|
|
336
|
+
When improving many agents:
|
|
337
|
+
|
|
338
|
+
- Run one agent per branch or worktree to avoid file and commit collisions.
|
|
339
|
+
- Keep scorers and adjudicators read-only.
|
|
340
|
+
- Keep editors single-file.
|
|
341
|
+
- Let the parent orchestrator own git writes and `eval/results.tsv`.
|
|
342
|
+
- Stop early when an agent clears the target score or recent deltas are too small to justify another pass.
|
|
343
|
+
- Use role baselines for every agent; missing baselines are now a setup failure.
|
|
344
|
+
|
|
345
|
+
For Codex and Claude Code under newer models:
|
|
346
|
+
|
|
347
|
+
- Scorer/judge: strongest available reasoning model.
|
|
348
|
+
- Editor: faster strong model.
|
|
349
|
+
- Parent: reliable tool-using orchestrator.
|
|
350
|
+
- Adjudicator: different strong model family when available.
|
package/AGENTS.md
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
# Healthcare Agents Repository Instructions
|
|
2
|
+
|
|
3
|
+
## Repository Map
|
|
4
|
+
|
|
5
|
+
- Agent prompts live in `agents/*.md`.
|
|
6
|
+
- The simple self-improvement kit lives in `.claude/commands/eval.md`, `eval/rubric.md`, `eval/results.tsv`, and `eval/role-baselines/`.
|
|
7
|
+
- The old Python eval harness has been removed. The active eval path is the simple self-improvement kit above.
|
|
8
|
+
|
|
9
|
+
## Git Workflow
|
|
10
|
+
|
|
11
|
+
- Do not push directly to `main`.
|
|
12
|
+
- For requested edits, create a short-lived branch, commit there, push the branch, open a PR, and merge the PR with `gh pr merge`.
|
|
13
|
+
- Do not merge the feature branch into local `main` before opening or merging the PR. That creates duplicate local merge commits and makes `main` appear ahead/behind after GitHub merges the PR.
|
|
14
|
+
- After a PR is merged, run `git fetch origin --prune` and align local `main` to `origin/main` before continuing work.
|
|
15
|
+
- For docs-only or metadata-only changes, the streamlined path is: branch -> commit -> push branch -> `gh pr create` -> `gh pr merge --merge --delete-branch` -> sync local `main`.
|
|
16
|
+
|
|
17
|
+
## Self-Improvement Loop
|
|
18
|
+
|
|
19
|
+
- When asked to run the healthcare self-improvement loop for an agent, first read `.claude/commands/eval.md` and execute that procedure as a normal task, substituting `$ARGUMENTS` with the requested agent slug.
|
|
20
|
+
- Treat `.claude/commands/eval.md` as the canonical workflow for both Claude Code and Codex.
|
|
21
|
+
- If the runtime supports native subagents or model specialization, prefer a strongest scorer/judge plus a faster editor, with the parent agent owning git writes and `eval/results.tsv`.
|
|
22
|
+
- Avoid recursive CLI invocation when native subagents are available.
|
|
23
|
+
- Never modify `eval/rubric.md` or any file under `eval/role-baselines/`.
|
|
24
|
+
- Never modify `eval/results.tsv` except to append rows.
|
|
25
|
+
- Preserve the agent's distinctive role identity; do not flatten prompts into generic "best practices" boilerplate.
|
|
26
|
+
- During a normal eval run, only edit the requested `agents/<slug>.md`, append `eval/results.tsv`, and write local ignored artifacts under `eval/run-logs/`.
|
package/CHANGELOG.md
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
## Unreleased
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
- Added `agents/registry.json` with discovery, routing, handoff, trust, and provenance metadata for all 51 agents.
|
|
10
|
+
- Added CLI commands for `list`, `show`, `choose`, `prompt`, and `doctor`, plus slug-validated single-agent installs.
|
|
11
|
+
- Added CI gates for installer syntax, agent linting/audit, package dry-run, CLI smoke tests, and installer dry-run.
|
|
12
|
+
- Added public eval scorecard generation under `docs/eval/scorecard.md` and `docs/eval/scorecard.json`.
|
|
13
|
+
- Added `docs/trust-and-safety.md` for scope, PHI, human escalation, source freshness, and eval limitations.
|
|
14
|
+
|
|
15
|
+
### Changed
|
|
16
|
+
|
|
17
|
+
- Improved installer dry-run output to show exact planned writes and avoid dry-run directory creation.
|
|
18
|
+
- Added npm package scripts and Node engine metadata for distribution readiness.
|
|
19
|
+
|
|
20
|
+
## [1.2.0] - 2026-05-05
|
|
21
|
+
|
|
22
|
+
Usability release for task-based agent selection, output modes, and handoffs.
|
|
23
|
+
|
|
24
|
+
### Added
|
|
25
|
+
|
|
26
|
+
- Added `docs/usage/agent-selection-guide.md` with task-to-agent routing for common healthcare administration jobs.
|
|
27
|
+
- Added `docs/usage/starter-prompts.md` with copy-ready prompts across all 10 domains.
|
|
28
|
+
- Added `docs/usage/handoff-map.md` for cross-functional workflows and human escalation owners.
|
|
29
|
+
- Added release-only usability smoke scenarios in `docs/eval/usability-release-check.md`.
|
|
30
|
+
|
|
31
|
+
### Changed
|
|
32
|
+
|
|
33
|
+
- Added role-tailored `Best Inputs`, `Output Modes`, and `Collaboration & Handoffs` sections to all 51 agents.
|
|
34
|
+
- Updated README discovery flow, installer-managed Codex guidance, contribution template, lint checks, and audit scoring for the new usability contract.
|
|
35
|
+
- Bumped package and installer metadata to v1.2.0 for GitHub-backed installs.
|
|
36
|
+
|
|
37
|
+
See [docs/release-notes/2026-05-05-usability-release.md](docs/release-notes/2026-05-05-usability-release.md) for full details.
|
|
38
|
+
|
|
39
|
+
## [1.1.2] - 2026-04-23
|
|
40
|
+
|
|
41
|
+
Documentation correction for npm-backed install commands.
|
|
42
|
+
|
|
43
|
+
### Changed
|
|
44
|
+
|
|
45
|
+
- Updated README and INSTALL examples to use `npx --yes github:ajhcs/healthcare-agents` because the package name is not published to npm from this environment yet.
|
|
46
|
+
- Bumped installer/package metadata to v1.1.2 so GitHub-backed npx installs report the latest compatibility release.
|
|
47
|
+
|
|
48
|
+
## [1.1.1] - 2026-04-23
|
|
49
|
+
|
|
50
|
+
Installer and documentation compatibility release.
|
|
51
|
+
|
|
52
|
+
### Changed
|
|
53
|
+
|
|
54
|
+
- Normalized all agent frontmatter `name` fields to lowercase hyphen slugs for Claude Code and OpenCode compatibility.
|
|
55
|
+
- Added `display_name` frontmatter to preserve human-readable agent names.
|
|
56
|
+
- Expanded the installer with aliases for Codex App, Claude Desktop, Claude Cowork, OpenCode, and portable SKILL.md targets.
|
|
57
|
+
- Generated valid per-agent `SKILL.md` folders for Claude Skills, OpenCode skills, and the open `.agents/skills` convention.
|
|
58
|
+
- Updated Codex install behavior to write a managed `~/.codex/AGENTS.md` discovery block.
|
|
59
|
+
- Refreshed `README.md` and `INSTALL.md` for v1.1.x, current eval status, and cross-tool install paths.
|
|
60
|
+
- Updated the self-improvement kit installer to copy all role baselines, not only the medical coding baseline.
|
|
61
|
+
|
|
62
|
+
## [1.1.0] - 2026-04-23
|
|
63
|
+
|
|
64
|
+
Agent-stack optimization release for the full 51-agent healthcare administration library.
|
|
65
|
+
|
|
66
|
+
### Changed
|
|
67
|
+
|
|
68
|
+
- Improved all 51 healthcare agent prompts through same-question before/after eval passes.
|
|
69
|
+
- Raised the first 15 evaluated agents from an 85.0 average score to 93.9.
|
|
70
|
+
- Raised the remaining 36 evaluated agents from an 85.11 average score to 95.50.
|
|
71
|
+
- Added more role-specific mechanics, compliance boundaries, source hierarchies, workflow handoffs, and deliverable requirements across clinical, operations, payer, quality, health IT, population health, pharmacy, revenue, and strategy agents.
|
|
72
|
+
- Reworked the eval workflow for modern SOTA model routing with parent orchestrator, scorer/judge, editor, and optional adjudicator roles.
|
|
73
|
+
- Required reusable full-question artifacts for before/after evals so score deltas can be audited against exact Q001-Q025 prompts.
|
|
74
|
+
|
|
75
|
+
### Added
|
|
76
|
+
|
|
77
|
+
- Role baselines for all 51 installable healthcare agents.
|
|
78
|
+
- Eval scorer guidance in `docs/eval/exam-architect-playbook.md`.
|
|
79
|
+
- Current model-routing guidance in `docs/eval/model-tuning.md`.
|
|
80
|
+
- Meta-eval checks for judge calibration, scorer consistency, and prompt-overfitting risk.
|
|
81
|
+
- Local run-log documentation for retained questions, scorer outputs, editor briefs, and summaries.
|
|
82
|
+
|
|
83
|
+
### Removed
|
|
84
|
+
|
|
85
|
+
- Retired the unused Python/DSPy eval harness and related schema, rubric, test, and runner files.
|
|
86
|
+
- Removed the standalone eval exam architect agent prompt in favor of the documented eval skill/playbook workflow.
|
|
87
|
+
|
|
88
|
+
See [docs/release-notes/2026-04-23-agent-stack-optimization.md](docs/release-notes/2026-04-23-agent-stack-optimization.md) for full details.
|
|
89
|
+
|
|
90
|
+
## [1.0.0] - 2026-04-09
|
|
91
|
+
|
|
92
|
+
Initial release of the healthcare-agents repository.
|
|
93
|
+
|
|
94
|
+
### Added
|
|
95
|
+
|
|
96
|
+
- 51 healthcare administration agent system prompts across 10 categories: Revenue, Clinical, Quality, Payer, Operations, Health IT, Population Health, Pharmacy, Strategy, and Emergency Preparedness.
|
|
97
|
+
- Karpathy-style automated eval loop (`/eval` command) with frozen rubric (Accuracy 0.40, Completeness 0.35, Specificity 0.25).
|
|
98
|
+
- Split-role scoring architecture: strong judge model generates exams and critiques, fast editor model patches prompts, parent orchestrator owns git writes.
|
|
99
|
+
- Identity-preservation constraints to prevent prompt drift during automated improvement.
|
|
100
|
+
- Git-ratcheted commit strategy: improvements commit atomically, regressions revert automatically.
|
|
101
|
+
- Append-only results log at `eval/results.tsv`.
|
|
102
|
+
- 10 agents improved to 80+ scores through the eval loop, including Revenue Medical Coding Specialist (82.15), Revenue Finance Manager (81.55), and Revenue 340B Program Manager (81.20).
|
|
103
|
+
- Cross-tool self-improvement kit and agent quality review infrastructure.
|
|
104
|
+
|
|
105
|
+
See [docs/release-notes/2026-04-09-eval-loop-milestone.md](docs/release-notes/2026-04-09-eval-loop-milestone.md) for full details.
|