@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (135) hide show
  1. package/README.md +156 -0
  2. package/assets/init/.agents/skills/earos-artifact-gen/SKILL.md +106 -0
  3. package/assets/init/.agents/skills/earos-artifact-gen/references/interview-guide.md +313 -0
  4. package/assets/init/.agents/skills/earos-artifact-gen/references/output-guide.md +367 -0
  5. package/assets/init/.agents/skills/earos-assess/SKILL.md +212 -0
  6. package/assets/init/.agents/skills/earos-assess/references/calibration-benchmarks.md +160 -0
  7. package/assets/init/.agents/skills/earos-assess/references/output-templates.md +311 -0
  8. package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md +281 -0
  9. package/assets/init/.agents/skills/earos-calibrate/SKILL.md +153 -0
  10. package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md +188 -0
  11. package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md +263 -0
  12. package/assets/init/.agents/skills/earos-create/SKILL.md +257 -0
  13. package/assets/init/.agents/skills/earos-create/references/criterion-writing-guide.md +268 -0
  14. package/assets/init/.agents/skills/earos-create/references/dependency-rules.md +193 -0
  15. package/assets/init/.agents/skills/earos-create/references/rubric-interview-guide.md +123 -0
  16. package/assets/init/.agents/skills/earos-create/references/validation-checklist.md +238 -0
  17. package/assets/init/.agents/skills/earos-profile-author/SKILL.md +251 -0
  18. package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md +280 -0
  19. package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md +158 -0
  20. package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md +173 -0
  21. package/assets/init/.agents/skills/earos-remediate/SKILL.md +118 -0
  22. package/assets/init/.agents/skills/earos-remediate/references/output-template.md +199 -0
  23. package/assets/init/.agents/skills/earos-remediate/references/remediation-patterns.md +330 -0
  24. package/assets/init/.agents/skills/earos-report/SKILL.md +85 -0
  25. package/assets/init/.agents/skills/earos-report/references/portfolio-template.md +181 -0
  26. package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md +168 -0
  27. package/assets/init/.agents/skills/earos-review/SKILL.md +130 -0
  28. package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md +163 -0
  29. package/assets/init/.agents/skills/earos-review/references/output-template.md +180 -0
  30. package/assets/init/.agents/skills/earos-template-fill/SKILL.md +177 -0
  31. package/assets/init/.agents/skills/earos-template-fill/references/evidence-writing-guide.md +186 -0
  32. package/assets/init/.agents/skills/earos-template-fill/references/section-rubric-mapping.md +200 -0
  33. package/assets/init/.agents/skills/earos-validate/SKILL.md +113 -0
  34. package/assets/init/.agents/skills/earos-validate/references/fix-patterns.md +281 -0
  35. package/assets/init/.agents/skills/earos-validate/references/validation-checks.md +287 -0
  36. package/assets/init/.claude/CLAUDE.md +4 -0
  37. package/assets/init/AGENTS.md +293 -0
  38. package/assets/init/CLAUDE.md +635 -0
  39. package/assets/init/README.md +507 -0
  40. package/assets/init/calibration/gold-set/.gitkeep +0 -0
  41. package/assets/init/calibration/results/.gitkeep +0 -0
  42. package/assets/init/core/core-meta-rubric.yaml +643 -0
  43. package/assets/init/docs/consistency-report.md +325 -0
  44. package/assets/init/docs/getting-started.md +194 -0
  45. package/assets/init/docs/profile-authoring-guide.md +51 -0
  46. package/assets/init/docs/terminology.md +126 -0
  47. package/assets/init/earos.manifest.yaml +104 -0
  48. package/assets/init/evaluations/.gitkeep +0 -0
  49. package/assets/init/examples/aws-event-driven-order-processing/artifact.yaml +2056 -0
  50. package/assets/init/examples/aws-event-driven-order-processing/evaluation.yaml +973 -0
  51. package/assets/init/examples/aws-event-driven-order-processing/report.md +244 -0
  52. package/assets/init/examples/example-solution-architecture.evaluation.yaml +136 -0
  53. package/assets/init/examples/multi-cloud-data-analytics/artifact.yaml +715 -0
  54. package/assets/init/overlays/data-governance.yaml +94 -0
  55. package/assets/init/overlays/regulatory.yaml +154 -0
  56. package/assets/init/overlays/security.yaml +92 -0
  57. package/assets/init/profiles/adr.yaml +225 -0
  58. package/assets/init/profiles/capability-map.yaml +223 -0
  59. package/assets/init/profiles/reference-architecture.yaml +426 -0
  60. package/assets/init/profiles/roadmap.yaml +205 -0
  61. package/assets/init/profiles/solution-architecture.yaml +227 -0
  62. package/assets/init/research/architecture-assessment-rubrics-research.docx +0 -0
  63. package/assets/init/research/architecture-assessment-rubrics-research.md +566 -0
  64. package/assets/init/research/reference-architecture-research.md +751 -0
  65. package/assets/init/standard/EAROS.md +1426 -0
  66. package/assets/init/standard/schemas/artifact.schema.json +1295 -0
  67. package/assets/init/standard/schemas/artifact.uischema.json +65 -0
  68. package/assets/init/standard/schemas/evaluation.schema.json +284 -0
  69. package/assets/init/standard/schemas/rubric.schema.json +383 -0
  70. package/assets/init/templates/evaluation-record.template.yaml +58 -0
  71. package/assets/init/templates/new-profile.template.yaml +65 -0
  72. package/bin.js +188 -0
  73. package/dist/assets/_basePickBy-BVu6YmSW.js +1 -0
  74. package/dist/assets/_baseUniq-CWRzQDz_.js +1 -0
  75. package/dist/assets/arc-CyDBhtDM.js +1 -0
  76. package/dist/assets/architectureDiagram-2XIMDMQ5-BH6O4dvN.js +36 -0
  77. package/dist/assets/blockDiagram-WCTKOSBZ-2xmwdjpg.js +132 -0
  78. package/dist/assets/c4Diagram-IC4MRINW-BNmPRFJF.js +10 -0
  79. package/dist/assets/channel-CiySTNoJ.js +1 -0
  80. package/dist/assets/chunk-4BX2VUAB-DGQTvirp.js +1 -0
  81. package/dist/assets/chunk-55IACEB6-DNMAQAC_.js +1 -0
  82. package/dist/assets/chunk-FMBD7UC4-BJbVTQ5o.js +15 -0
  83. package/dist/assets/chunk-JSJVCQXG-BCxUL74A.js +1 -0
  84. package/dist/assets/chunk-KX2RTZJC-H7wWZOfz.js +1 -0
  85. package/dist/assets/chunk-NQ4KR5QH-BK4RlTQF.js +220 -0
  86. package/dist/assets/chunk-QZHKN3VN-0chxDV5g.js +1 -0
  87. package/dist/assets/chunk-WL4C6EOR-DexfQ-AV.js +189 -0
  88. package/dist/assets/classDiagram-VBA2DB6C-D7luWJQn.js +1 -0
  89. package/dist/assets/classDiagram-v2-RAHNMMFH-D7luWJQn.js +1 -0
  90. package/dist/assets/clone-ylgRbd3D.js +1 -0
  91. package/dist/assets/cose-bilkent-S5V4N54A-DS2IOCfZ.js +1 -0
  92. package/dist/assets/cytoscape.esm-CyJtwmzi.js +331 -0
  93. package/dist/assets/dagre-KLK3FWXG-BbSoTTa3.js +4 -0
  94. package/dist/assets/defaultLocale-DX6XiGOO.js +1 -0
  95. package/dist/assets/diagram-E7M64L7V-C9TvYgv0.js +24 -0
  96. package/dist/assets/diagram-IFDJBPK2-DowUMWrg.js +43 -0
  97. package/dist/assets/diagram-P4PSJMXO-BL6nrnQF.js +24 -0
  98. package/dist/assets/erDiagram-INFDFZHY-rXPRl8VM.js +70 -0
  99. package/dist/assets/flowDiagram-PKNHOUZH-DBRM99-W.js +162 -0
  100. package/dist/assets/ganttDiagram-A5KZAMGK-INcWFsBT.js +292 -0
  101. package/dist/assets/gitGraphDiagram-K3NZZRJ6-DMwpfE91.js +65 -0
  102. package/dist/assets/graph-DLQn37b-.js +1 -0
  103. package/dist/assets/index-BFFITMT8.js +650 -0
  104. package/dist/assets/index-H7f6VTz1.css +1 -0
  105. package/dist/assets/infoDiagram-LFFYTUFH-B0f4TWRM.js +2 -0
  106. package/dist/assets/init-Gi6I4Gst.js +1 -0
  107. package/dist/assets/ishikawaDiagram-PHBUUO56-CsU6XimZ.js +70 -0
  108. package/dist/assets/journeyDiagram-4ABVD52K-CQ7ibNib.js +139 -0
  109. package/dist/assets/kanban-definition-K7BYSVSG-DzEN7THt.js +89 -0
  110. package/dist/assets/katex-B1X10hvy.js +261 -0
  111. package/dist/assets/layout-C0dvb42R.js +1 -0
  112. package/dist/assets/linear-j4a8mGj7.js +1 -0
  113. package/dist/assets/mindmap-definition-YRQLILUH-DP8iEuCf.js +68 -0
  114. package/dist/assets/ordinal-Cboi1Yqb.js +1 -0
  115. package/dist/assets/pieDiagram-SKSYHLDU-BpIAXgAm.js +30 -0
  116. package/dist/assets/quadrantDiagram-337W2JSQ-DrpXn5Eg.js +7 -0
  117. package/dist/assets/requirementDiagram-Z7DCOOCP-Bg7EwHlG.js +73 -0
  118. package/dist/assets/sankeyDiagram-WA2Y5GQK-BWagRs1F.js +10 -0
  119. package/dist/assets/sequenceDiagram-2WXFIKYE-q5jwhivG.js +145 -0
  120. package/dist/assets/stateDiagram-RAJIS63D-B_J9pE-2.js +1 -0
  121. package/dist/assets/stateDiagram-v2-FVOUBMTO-Q_1GcybB.js +1 -0
  122. package/dist/assets/timeline-definition-YZTLITO2-dv0jgQ0z.js +61 -0
  123. package/dist/assets/treemap-KZPCXAKY-Dt1dkIE7.js +162 -0
  124. package/dist/assets/vennDiagram-LZ73GAT5-BdO5RgRZ.js +34 -0
  125. package/dist/assets/xychartDiagram-JWTSCODW-CpDVe-8v.js +7 -0
  126. package/dist/index.html +23 -0
  127. package/export-docx.js +1583 -0
  128. package/init.js +353 -0
  129. package/manifest-cli.mjs +207 -0
  130. package/package.json +83 -0
  131. package/schemas/artifact.schema.json +1295 -0
  132. package/schemas/artifact.uischema.json +65 -0
  133. package/schemas/evaluation.schema.json +284 -0
  134. package/schemas/rubric.schema.json +383 -0
  135. package/serve.js +238 -0
@@ -0,0 +1,281 @@
1
+ # Scoring Protocol — EAROS Assessment
2
+
3
+ This file contains the detailed evidence and scoring guidance for EAROS evaluators. Read this before Step 2 (content extraction). Return to it any time scoring feels ambiguous.
4
+
5
+ ---
6
+
7
+ ## The RULERS Protocol — Evidence-Anchored Scoring
8
+
9
+ RULERS stands for the core principle: **every score must be anchored to a retrievable unit of evidence from the artifact**. This prevents the single most common failure in architecture assessment — scoring from overall impression rather than specific evidence.
10
+
11
+ ### Why it matters
12
+
13
+ An agent (or human) that scores from impression tends to:
14
+ - Over-score well-written artifacts (fluent prose feels like good architecture)
15
+ - Under-score technical artifacts (dense notation looks like poor communication)
16
+ - Produce unreproducible scores (two reviewers get different results on the same artifact)
17
+
18
+ RULERS scoring is reproducible because it ties every score to a specific excerpt. If two reviewers disagree, they can compare evidence anchors rather than arguing from impressions.
19
+
20
+ ### The three evidence classes
21
+
22
+ Before assigning any score, classify the evidence you found:
23
+
24
+ | Class | Definition | Credibility |
25
+ |-------|------------|-------------|
26
+ | `observed` | Directly supported by a quote or excerpt from the artifact | Highest — the artifact says it |
27
+ | `inferred` | Reasonable interpretation not directly stated, but logically implied | Medium — the artifact implies it |
28
+ | `external` | Judgment based on a standard, policy, or pattern outside the artifact | Lowest — you brought the knowledge |
29
+
30
+ **The discipline:** If you cannot find at least `inferred` evidence, you cannot score the criterion above 1. A score of 0 or 1 with `evidence_class: none` is legitimate — it means the artifact does not address the criterion.
31
+
32
+ ### Evidence extraction steps (for each criterion)
33
+
34
+ 1. Read the criterion's `required_evidence` list in the rubric YAML
35
+ 2. Search the artifact specifically for each piece of required evidence
36
+ 3. Record:
37
+ - `evidence_anchor`: where in the artifact (section heading, page number, diagram label, appendix name)
38
+ - `excerpt`: direct quote or very close paraphrase — must be retrievable, not invented
39
+ - `evidence_class`: observed / inferred / external / none
40
+
41
+ ---
42
+
43
+ ## How to Use scoring_guide and decision_tree
44
+
45
+ The rubric YAML for each criterion contains two fields that tell you exactly what each score means:
46
+
47
+ ### scoring_guide
48
+
49
+ The `scoring_guide` gives one-sentence level descriptors for scores 0–4. These are the authoritative definitions. Examples from the core rubric:
50
+
51
+ **STK-01 (stakeholder identification):**
52
+ ```
53
+ "0": Absent or contradicted
54
+ "1": Implied only
55
+ "2": Explicit but incomplete
56
+ "3": Explicit and mostly complete
57
+ "4": Explicit, complete, and used consistently
58
+ ```
59
+
60
+ **SCP-01 (scope and boundaries):**
61
+ ```
62
+ "0": No scope or boundary
63
+ "1": Scope is ambiguous
64
+ "2": Basic scope exists but is incomplete
65
+ "3": Scope and boundaries are clear
66
+ "4": Scope and boundaries are clear, tested, and internally consistent
67
+ ```
68
+
69
+ Use these as your first reference. If the artifact clearly matches one level, assign that score.
70
+
71
+ ### decision_tree
72
+
73
+ The `decision_tree` translates the scoring guide into a sequence of observable conditions. Use this when the scoring guide alone doesn't resolve the case. Example from SCP-01:
74
+
75
+ ```
76
+ IF no scope section THEN score 0.
77
+ IF scope exists but no exclusions listed THEN max score 2.
78
+ IF assumptions not stated THEN max score 3.
79
+ ```
80
+
81
+ The decision tree is a ceiling, not a floor. It tells you the maximum score given observable conditions.
82
+
83
+ **Working together:** Match the artifact against the scoring guide first. If ambiguous, walk the decision tree as a tiebreaker.
84
+
85
+ ---
86
+
87
+ ## Examples of GOOD Scoring
88
+
89
+ A good criterion result is specific, grounded, and honest about evidence quality.
90
+
91
+ ### Example 1 — Observed evidence, high confidence
92
+
93
+ ```yaml
94
+ criterion_id: SCP-01
95
+ score: 3
96
+ evidence_class: observed
97
+ confidence: high
98
+ evidence_anchor: "Section 1.2 Scope"
99
+ excerpt: >
100
+ "In scope: Payments service, Notification service, upstream Banking Core API.
101
+ Out of scope: Authentication (handled by IAM platform — see IAM-2024-001),
102
+ analytics pipeline, reporting layer. Assumptions: Banking Core API versioned
103
+ contract stable for 12 months."
104
+ rationale: >
105
+ Scope is explicitly bounded with named in-scope and out-of-scope elements,
106
+ and a key assumption is stated with a consequence. Assumptions are present
107
+ but no in-scope/out-of-scope boundary diagram or consistency test is provided,
108
+ which would be required for a score of 4.
109
+ ```
110
+
111
+ Why this is good:
112
+ - The excerpt is a direct quote from a specific section
113
+ - The rationale explains both why the score is 3 and why it's not 4
114
+ - Evidence class is correctly classified as `observed` (directly stated)
115
+
116
+ ### Example 2 — Inferred evidence, medium confidence
117
+
118
+ ```yaml
119
+ criterion_id: TRC-01
120
+ score: 2
121
+ evidence_class: inferred
122
+ confidence: medium
123
+ evidence_anchor: "Section 2.1 Drivers and Section 4 Architecture Decisions"
124
+ excerpt: >
125
+ "Section 2.1 lists availability and time-to-market as drivers. Section 4
126
+ adopts an event-driven pattern described as 'enabling independent scaling'."
127
+ rationale: >
128
+ The connection between the availability driver and the event-driven decision
129
+ is implied by the phrase 'enabling independent scaling', but no explicit
130
+ traceability link is made. The decision does not reference the driver ID
131
+ or requirement number. This is inferred, not observed — the link exists
132
+ but the artifact doesn't draw it explicitly. Score 2 (partial) is appropriate;
133
+ score 3 would require explicit driver-to-decision links.
134
+ ```
135
+
136
+ Why this is good:
137
+ - Honestly classified as `inferred` rather than `observed`
138
+ - The rationale explains the gap clearly and cites the decision tree logic
139
+ - Medium confidence correctly reflects the inferential nature of the evidence
140
+
141
+ ### Example 3 — Absent evidence, scoring 0
142
+
143
+ ```yaml
144
+ criterion_id: RAT-01
145
+ score: 0
146
+ evidence_class: none
147
+ confidence: high
148
+ evidence_anchor: "Entire document reviewed"
149
+ excerpt: "No risks, assumptions, constraints, or trade-offs section found."
150
+ rationale: >
151
+ The artifact contains no RAID log, no risk section, and no trade-off discussion.
152
+ The word 'risk' appears once in passing ('we need to consider operational risk')
153
+ but with no specific identification, owner, or mitigation. This does not meet
154
+ the scoring_guide level 1 threshold ('mentioned superficially'). Score 0.
155
+ ```
156
+
157
+ Why this is good:
158
+ - Confidence is high because the absence is certain
159
+ - The one occurrence of "risk" is acknowledged but correctly scored as below level 1
160
+ - Evidence anchor notes that the whole document was reviewed
161
+
162
+ ---
163
+
164
+ ## Examples of BAD Scoring
165
+
166
+ ### Bad Example 1 — Impression-based scoring
167
+
168
+ ```yaml
169
+ criterion_id: CMP-01
170
+ score: 3
171
+ evidence_class: observed
172
+ confidence: high
173
+ evidence_anchor: "Section 5"
174
+ excerpt: "The architecture appears to be well-aligned with enterprise standards."
175
+ rationale: "The document seems to address compliance comprehensively."
176
+ ```
177
+
178
+ Problems:
179
+ - "appears to be well-aligned" is the evaluator's impression, not a quote from the artifact
180
+ - The excerpt is a vague paraphrase, not specific evidence
181
+ - No specific standard or control is named
182
+ - `evidence_class: observed` is wrong — nothing is directly quoted
183
+ - This scores CMP-01 as 3 without evidence of any control mapping; it should score 1 or 2 at most
184
+
185
+ ### Bad Example 2 — Evidence class mismatch
186
+
187
+ ```yaml
188
+ criterion_id: STK-01
189
+ score: 4
190
+ evidence_class: observed
191
+ confidence: high
192
+ evidence_anchor: "Section 1"
193
+ excerpt: "Section 1 contains a list of stakeholders."
194
+ rationale: "Stakeholders are identified and the document is clearly structured for them."
195
+ ```
196
+
197
+ Problems:
198
+ - Score 4 requires stakeholders to be "explicit, complete, and used consistently throughout"
199
+ - "clearly structured for them" is an inference, not an observation — evidence class should be `inferred`
200
+ - The excerpt doesn't quote what stakeholders are listed or how they're used consistently
201
+ - A correct score here is probably 2 or 3 depending on actual content
202
+
203
+ ### Bad Example 3 — Conflating confidence with score
204
+
205
+ ```yaml
206
+ criterion_id: ACT-01
207
+ score: 1
208
+ evidence_class: observed
209
+ confidence: low
210
+ rationale: "I'm not confident about this score, so I've scored it lower to be safe."
211
+ ```
212
+
213
+ Problems:
214
+ - Confidence must not modify the score — this is an explicit rule in EAROS
215
+ - If the evidence supports score 3 but you're uncertain about interpretation, score 3 with `confidence: low`
216
+ - Low confidence tells the human reviewer to apply independent judgment; it doesn't change the number
217
+
218
+ ---
219
+
220
+ ## Challenge Pass Methodology
221
+
222
+ The challenge pass is Step 6. It exists because agents (and humans) are systematically biased in specific ways:
223
+
224
+ **Common over-scoring patterns:**
225
+ - Rewarding presence over quality (something is mentioned → score 2, regardless of depth)
226
+ - Treating fluent prose as evidence of good architecture
227
+ - Accepting assertions without checking for evidence backing
228
+ - Scoring the intent rather than the artifact as written
229
+
230
+ **Common under-scoring patterns:**
231
+ - Penalizing technically dense artifacts that are well-reasoned
232
+ - Requiring explicit labels for things that are clearly present
233
+ - Applying a standard that was not in the rubric
234
+
235
+ ### How to run the challenge pass
236
+
237
+ For each of your three highest and three lowest scores:
238
+
239
+ 1. **Higher challenge:** "What in the artifact would justify scoring this one level higher? Does that evidence exist?"
240
+ - If yes and you missed it → revise up, flag as `revised: true`
241
+ - If no → score confirmed
242
+
243
+ 2. **Lower challenge:** "What would need to be true for this score to be one level lower? Is that condition actually met?"
244
+ - If yes and you overclaimed → revise down, flag as `revised: true`
245
+ - If no → score confirmed
246
+
247
+ 3. **Evidence class check:** "Is my evidence classification honest?"
248
+ - `observed` should be a direct quote or highly specific reference
249
+ - If you're paraphrasing liberally → downgrade to `inferred`
250
+ - If you're applying outside knowledge → classify as `external`
251
+
252
+ Mark any score that changes as `revised: true` in the output. This is not a sign of weakness — it's the protocol working correctly.
253
+
254
+ ---
255
+
256
+ ## Cross-Reference Validation
257
+
258
+ Cross-reference validation (Step 4) finds inconsistencies that reduce the reliability of the artifact and directly affect `CON-01` (internal consistency criterion in the core rubric).
259
+
260
+ ### What to check
261
+
262
+ **Component naming:** Does the context diagram call it "Payment Service" while the sequence diagram calls it "PaymentProcessor" and the deployment view shows "payment-svc"? These may or may not be the same thing — the artifact should make this explicit.
263
+
264
+ **Interface consistency:** Does the API spec show a `POST /payments/initiate` endpoint, but the sequence diagram shows `createPayment()` with a different parameter shape? One of them is wrong or they're different things.
265
+
266
+ **Scope boundary drift:** Does the context diagram show the mobile app as out of scope, but the data flow section describes data flowing through the mobile app? The scope boundary shifted between sections without notice.
267
+
268
+ **Narrative-diagram agreement:** Does the text say "all traffic passes through the API gateway" but the deployment diagram shows a direct database connection from a service?
269
+
270
+ ### How to score inconsistencies
271
+
272
+ Each inconsistency found is evidence for `CON-01`. Use the `decision_tree` for CON-01:
273
+ ```
274
+ IF direct contradictions between sections THEN score 0.
275
+ IF same entity has different names or interfaces in multiple views THEN score 1.
276
+ IF minor naming inconsistencies only THEN score 2.
277
+ IF terminology and interfaces consistent across main views THEN score 3.
278
+ IF glossary present AND entities reconciled AND cross-references verified THEN score 4.
279
+ ```
280
+
281
+ Note each specific inconsistency as a separate evidence item in the CON-01 result.
@@ -0,0 +1,153 @@
1
+ ---
2
+ name: earos-calibrate
3
+ description: "Run EAROS calibration exercises to validate rubric reliability before production use. Use this skill whenever someone wants to calibrate a rubric, validate inter-rater reliability, compare scores against gold-standard artifacts, measure scoring consistency, or says \"calibrate this rubric\", \"run calibration\", \"check if the rubric is reliable\", \"compare my scores to the gold set\", \"test this profile against examples\", \"is this rubric ready for production\", \"what is our kappa\", \"measure agreement between reviewers\", \"validate a new profile\", or \"how well does the rubric score consistently\". Calibration is required before any new profile can move from draft to candidate status."
4
+ ---
5
+
6
+ # EAROS Calibrate Skill
7
+
8
+ You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.
9
+
10
+ **Why calibration matters:** A rubric that produces inconsistent scores is not a quality gate — it is noise. Without calibration, two reviewers applying the same rubric will score the same artifact differently, governance decisions will be arbitrary, and the framework loses credibility. Calibration makes the rubric trustworthy by measuring and improving its reproducibility.
11
+
12
+ **Target reliability metrics:**
13
+ - Binary agreement (exact match): > 95%
14
+ - Ordinal Cohen's κ: > 0.70 for well-defined criteria; > 0.50 for subjective criteria
15
+ - Spearman ρ (overall score correlation across artifacts): > 0.80
16
+
17
+ **Critical:** Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.
18
+
19
+ ---
20
+
21
+ ## Step 0 — Load Calibration Inputs
22
+
23
+ Read these files:
24
+ 1. `core/core-meta-rubric.yaml`
25
+ 2. The profile or overlay being calibrated (ask if not specified; scan `profiles/` and `overlays/`)
26
+ 3. `calibration/gold-set/` — scan for existing reference artifacts and their benchmark scores
27
+ 4. `calibration/results/` — scan for prior calibration runs (to understand trends)
28
+
29
+ Ask the user:
30
+ - Which rubric/profile is being calibrated? (if not specified)
31
+ - Are there artifacts to calibrate against, or should I use the gold-set?
32
+ - Solo calibration (agent vs. gold-set) or multi-evaluator reconciliation?
33
+
34
+ ---
35
+
36
+ ## Step 1 — Artifact Inventory
37
+
38
+ List available calibration artifacts. For each:
39
+ - Artifact ID, title, type
40
+ - Expected quality category: strong (≥3.2), adequate (2.4–3.19), weak (<2.4), borderline
41
+ - Known benchmark scores (if prior calibration exists)
42
+
43
+ If no gold-set artifacts exist, stop and tell the user:
44
+ > "Calibration requires at least 3 artifacts: 1 strong (should score ≥3.2), 1 weak (should score <2.4), and 1 ambiguous (borderline case). The spread across quality levels is important — calibration against only strong artifacts doesn't test whether the rubric correctly identifies weaknesses. Please provide these artifacts or their paths."
45
+
46
+ ---
47
+
48
+ ## Step 2 — Independent Scoring
49
+
50
+ For each calibration artifact, run a full EAROS assessment using the `earos-assess` skill protocol:
51
+ - Follow the full 8-step DAG
52
+ - Score every criterion independently **before** looking at gold-set benchmark scores
53
+ - Record evidence anchors, evidence classes, confidence, and rationale for every score
54
+
55
+ **This step cannot be skipped or abbreviated.** Independent scoring is the entire point of calibration. If you score after seeing the benchmark, you measure nothing.
56
+
57
+ > **For the full assessment protocol**, see `.claude/skills/earos-assess/SKILL.md`.
58
+
59
+ ---
60
+
61
+ ## Step 3 — Score Comparison
62
+
63
+ After completing independent scoring for all artifacts, compare against the gold-set:
64
+
65
+ ```yaml
66
+ artifact_id: [ID]
67
+ criterion_id: [ID]
68
+ gold_score: [benchmark]
69
+ agent_score: [your score]
70
+ delta: [gold - agent] # positive = agent under-scored; negative = agent over-scored
71
+ delta_abs: [abs(delta)]
72
+ agreement: exact | within_1 | disagreement # disagreement = delta_abs >= 2
73
+ evidence_quality_match: yes | partial | no
74
+ ```
75
+
76
+ > **Read `references/calibration-protocol.md`** for the full comparison procedure and how to handle cases where you believe the gold-set benchmark may itself be wrong.
77
+
78
+ ---
79
+
80
+ ## Step 4 — Agreement Metric Computation
81
+
82
+ > **Read `references/agreement-metrics.md`** before this step. It contains the formulas, computation steps, and interpretation guidance.
83
+
84
+ Key metrics to compute:
85
+
86
+ **Binary agreement:** (exact matches) / (total scored criteria)
87
+
88
+ **Per-criterion reliability flag:**
89
+ - `reliable`: max_delta ≤ 1 across all artifacts
90
+ - `moderate`: max_delta = 2 in isolated cases
91
+ - `unreliable`: max_delta ≥ 2 systematically
92
+
93
+ **Overall Spearman ρ:** rank correlation of overall scores across artifacts
94
+
95
+ **Verdict:** `pass_for_production` / `borderline` / `not_ready`
96
+
97
+ ---
98
+
99
+ ## Step 5 — Root Cause Analysis
100
+
101
+ For each `disagreement` (delta ≥ 2) or `unreliable` criterion, investigate:
102
+
103
+ 1. **Ambiguous level descriptor?** — Does `scoring_guide` clearly distinguish adjacent levels?
104
+ 2. **Missing decision tree?** — Does the criterion have a `decision_tree`?
105
+ 3. **Evidence classification issue?** — Observed vs. inferred disagreement?
106
+ 4. **Anti-pattern match?** — Did the artifact exhibit an anti-pattern scored differently by the gold-set?
107
+ 5. **Context sensitivity?** — Is this criterion's meaning different for different artifact sub-types?
108
+
109
+ For each root cause, recommend a specific rubric improvement (which field to change and how).
110
+
111
+ ---
112
+
113
+ ## Step 6 — Calibration Report
114
+
115
+ Save the report to `calibration/results/[rubric-id]-calibration-[YYYY-MM-DD].yaml`
116
+
117
+ > **Read `references/calibration-protocol.md#report-format`** for the full YAML and markdown report templates.
118
+
119
+ ---
120
+
121
+ ## Step 7 — Recalibration Triggers
122
+
123
+ After saving results, check whether recalibration is needed sooner than the standard 6-month cycle. Recalibrate when:
124
+ - Profile criteria change materially (version bump)
125
+ - New overlay introduced
126
+ - Agreement drops below targets on any criterion
127
+ - New artifact formats appear (new diagramming tools, document formats)
128
+ - Agent model changes materially
129
+ - Governance expectations change
130
+
131
+ Tell the user: "Schedule next calibration check: [6 months from today or at next profile revision]."
132
+
133
+ ---
134
+
135
+ ## Non-Negotiable Rules
136
+
137
+ 1. **Score independently first.** Never look at gold-set before producing your own assessment.
138
+ 2. **Don't calibrate to pass.** If you systematically disagree with the gold-set, flag it — the gold-set may need review too.
139
+ 3. **Unreliable criteria must be fixed before production.** κ < 0.50 = not usable in governance.
140
+ 4. **Calibration is ongoing.** A profile that passes today must be recalibrated after any material change.
141
+
142
+ ---
143
+
144
+ ## When to Read Which Reference File
145
+
146
+ | When | Read |
147
+ |------|------|
148
+ | Before Step 3 (always) | `references/calibration-protocol.md` |
149
+ | Computing agreement metrics (Step 4) | `references/agreement-metrics.md` |
150
+ | Interpreting κ values | `references/agreement-metrics.md#interpretation` |
151
+ | Writing the calibration report | `references/calibration-protocol.md#report-format` |
152
+ | Investigating disagreements (Step 5) | `references/calibration-protocol.md#root-cause-analysis` |
153
+ | Unsure if gold-set benchmark is correct | `references/calibration-protocol.md#gold-set-disagreement` |
@@ -0,0 +1,188 @@
1
+ # Agreement Metrics — EAROS Calibration
2
+
3
+ This file explains the agreement metrics used in EAROS calibration, how to compute them, and how to interpret the results. Read this before Step 4 (agreement metric computation).
4
+
5
+ ---
6
+
7
+ ## Why These Metrics?
8
+
9
+ EAROS uses three complementary metrics because each answers a different question:
10
+
11
+ - **Binary agreement** answers: "How often do we get the exact same score?"
12
+ - **Cohen's κ (ordinal/weighted)** answers: "Is our agreement better than random chance, accounting for the ordinal scale?"
13
+ - **Spearman ρ** answers: "Do we agree on which artifacts are better than which others?"
14
+
15
+ A rubric that passes all three is reliable. A rubric that passes only one may have a specific problem (e.g., good rank-ordering but systematic score inflation).
16
+
17
+ ---
18
+
19
+ ## Metric 1 — Binary Agreement (Exact Match Rate)
20
+
21
+ **What it measures:** Percentage of criteria where agent score = gold-set score exactly.
22
+
23
+ **Formula:**
24
+ ```
25
+ binary_agreement = (count of exact matches) / (total scored criteria) × 100%
26
+ ```
27
+
28
+ **Target:** > 95%
29
+
30
+ **Computation:**
31
+ 1. For each criterion in each artifact, compare `agent_score` to `gold_score`
32
+ 2. Count exact matches (delta = 0)
33
+ 3. Divide by total scored criteria (exclude N/A criteria from both sets)
34
+
35
+ **Example:**
36
+ ```
37
+ Criteria scored: 30 (3 artifacts × 10 criteria)
38
+ Exact matches: 24
39
+ Binary agreement: 24/30 = 80% ← BELOW TARGET — investigate
40
+ ```
41
+
42
+ **Interpretation:**
43
+ - > 95%: Excellent — rubric is highly reproducible
44
+ - 85–95%: Acceptable — minor rubric refinements may help
45
+ - < 85%: Concerning — systematic disagreements; investigate root causes
46
+
47
+ ---
48
+
49
+ ## Metric 2 — Ordinal Cohen's κ (Weighted Kappa) {#interpretation}
50
+
51
+ **What it measures:** Agreement corrected for chance, using linear weights that give partial credit for near-misses (delta = 1 counts as partial agreement).
52
+
53
+ **Why weighted (not simple) kappa:** EAROS uses a 0–4 ordinal scale. Disagreeing by 1 point (scoring 2 vs. 3) is less serious than disagreeing by 2 points (scoring 1 vs. 3). Weighted kappa captures this.
54
+
55
+ **Linear weight matrix (5-point scale):**
56
+ ```
57
+ Gold 0 Gold 1 Gold 2 Gold 3 Gold 4
58
+ Agent 0: 1.00 0.75 0.50 0.25 0.00
59
+ Agent 1: 0.75 1.00 0.75 0.50 0.25
60
+ Agent 2: 0.50 0.75 1.00 0.75 0.50
61
+ Agent 3: 0.25 0.50 0.75 1.00 0.75
62
+ Agent 4: 0.00 0.25 0.50 0.75 1.00
63
+ ```
64
+
65
+ **Computing kappa without a statistics library (simplified):**
66
+
67
+ Step 1: Build the confusion matrix (rows = agent scores, columns = gold scores, cells = count).
68
+
69
+ Step 2: Compute observed weighted agreement:
70
+ ```
71
+ P_obs = sum(weight[i,j] × n[i,j]) / N
72
+ where n[i,j] = count at row i, col j; N = total observations
73
+ ```
74
+
75
+ Step 3: Compute expected weighted agreement (if random based on marginals):
76
+ ```
77
+ P_exp = sum(weight[i,j] × row_marginal[i] × col_marginal[j]) / N²
78
+ ```
79
+
80
+ Step 4: κ = (P_obs - P_exp) / (1 - P_exp)
81
+
82
+ **Simplified approximation for small samples (3 artifacts):**
83
+ - All deltas ≤ 1: κ likely > 0.70 (reliable)
84
+ - Some deltas = 2: κ likely 0.50–0.70 (moderate)
85
+ - Any delta = 3–4: κ likely < 0.50 (unreliable)
86
+
87
+ **Target thresholds:**
88
+ - κ > 0.70: Substantial agreement — criterion reliable for governance use
89
+ - κ 0.50–0.70: Moderate — acceptable for subjective criteria; sharpen level descriptors
90
+ - κ < 0.50: Poor — criterion needs revision before production use
91
+
92
+ **EAROS-specific expectations:**
93
+ - Criteria with clear observable conditions (SCP-01, MNT-01): typically κ > 0.75
94
+ - Criteria requiring judgment (RAT-01 trade-off quality): typically κ 0.55–0.70
95
+ - Criteria evaluating depth (CMP-01 compliance treatment): hardest — target κ > 0.50
96
+
97
+ ---
98
+
99
+ ## Metric 3 — Spearman Rank Correlation (ρ)
100
+
101
+ **What it measures:** Whether agent and gold-set agree on the rank ordering of artifacts by quality.
102
+
103
+ **Why it matters:** Even if absolute scores differ, a rubric is useful if it reliably distinguishes strong artifacts from weak ones. Good rank correlation means the rubric correctly identifies which artifacts need more work.
104
+
105
+ **Formula:**
106
+ ```
107
+ ρ = 1 - (6 × Σd²) / (n × (n² - 1))
108
+ where d = rank difference for each artifact; n = number of artifacts
109
+ ```
110
+
111
+ **Example (3 artifacts, perfect ordering):**
112
+ ```
113
+ Artifact Gold_rank Agent_rank d d²
114
+ Strong 1 1 0 0
115
+ Adequate 2 2 0 0
116
+ Weak 3 3 0 0
117
+ ρ = 1 - (6×0)/(3×8) = 1.00
118
+ ```
119
+
120
+ **Example with rank reversal:**
121
+ ```
122
+ Artifact Gold_rank Agent_rank d d²
123
+ Strong 1 2 -1 1
124
+ Adequate 2 1 1 1
125
+ Weak 3 3 0 0
126
+ ρ = 1 - (6×2)/(3×8) = 1 - 0.5 = 0.50 ← Moderate
127
+ ```
128
+
129
+ **Target:** ρ > 0.80
130
+
131
+ **Interpretation:**
132
+ - ρ > 0.80: Good ordering — rubric reliably distinguishes quality levels
133
+ - ρ 0.60–0.80: Moderate — some rank reversals; investigate the reversed artifacts
134
+ - ρ < 0.60: Concerning — systematic misclassification of quality levels
135
+
136
+ ---
137
+
138
+ ## Per-Criterion Reliability Flags
139
+
140
+ After computing metrics, flag each criterion:
141
+
142
+ ```
143
+ criterion_id | mean_delta | max_delta | binary_agreement | reliability_flag
144
+ CRITERION-01 | 0.2 | 1 | 93% | reliable
145
+ CRITERION-02 | 0.8 | 2 | 67% | moderate
146
+ CRITERION-03 | 1.5 | 3 | 33% | unreliable
147
+ ```
148
+
149
+ **Flag definitions:**
150
+ - `reliable`: max_delta ≤ 1 across all artifacts; binary agreement ≥ 90%
151
+ - `moderate`: max_delta = 2 in some cases; binary agreement 70–90%
152
+ - `unreliable`: max_delta ≥ 2 systematically; binary agreement < 70%
153
+
154
+ Criteria flagged `unreliable` must be revised before production use.
155
+
156
+ ---
157
+
158
+ ## Overall Calibration Verdict
159
+
160
+ | Metric | Result | Target | Status |
161
+ |--------|--------|--------|--------|
162
+ | Binary agreement | X% | > 95% | pass/warn/fail |
163
+ | Criteria with κ > 0.70 | X of N | > 80% | pass/warn/fail |
164
+ | Criteria with κ < 0.50 | N | 0 | pass/warn/fail |
165
+ | Spearman ρ | X.XX | > 0.80 | pass/warn/fail |
166
+
167
+ **Verdict rules:**
168
+ - `pass_for_production`: All metrics pass; no unreliable criteria
169
+ - `borderline`: One metric borderline or 1–2 unreliable criteria; review before production
170
+ - `not_ready`: Multiple metrics fail or 3+ unreliable criteria; rubric revision required
171
+
172
+ ---
173
+
174
+ ## What to Do with Results
175
+
176
+ ### Pass
177
+ Profile ready for candidate status. Update `status: draft` → `status: candidate`.
178
+
179
+ ### Borderline
180
+ Review the specific criteria that missed targets:
181
+ 1. Is the `decision_tree` present and branches clear?
182
+ 2. Does the `scoring_guide` distinguish between levels 2 and 3 specifically?
183
+ 3. Are `examples.good` and `examples.bad` realistic and quotable?
184
+
185
+ Update flagged criteria and re-run calibration on those criteria only.
186
+
187
+ ### Not Ready
188
+ Do not advance to candidate. Revise the rubric per root cause analysis in `references/calibration-protocol.md#root-cause-analysis`, then re-run full calibration.