@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (135) hide show
  1. package/README.md +156 -0
  2. package/assets/init/.agents/skills/earos-artifact-gen/SKILL.md +106 -0
  3. package/assets/init/.agents/skills/earos-artifact-gen/references/interview-guide.md +313 -0
  4. package/assets/init/.agents/skills/earos-artifact-gen/references/output-guide.md +367 -0
  5. package/assets/init/.agents/skills/earos-assess/SKILL.md +212 -0
  6. package/assets/init/.agents/skills/earos-assess/references/calibration-benchmarks.md +160 -0
  7. package/assets/init/.agents/skills/earos-assess/references/output-templates.md +311 -0
  8. package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md +281 -0
  9. package/assets/init/.agents/skills/earos-calibrate/SKILL.md +153 -0
  10. package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md +188 -0
  11. package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md +263 -0
  12. package/assets/init/.agents/skills/earos-create/SKILL.md +257 -0
  13. package/assets/init/.agents/skills/earos-create/references/criterion-writing-guide.md +268 -0
  14. package/assets/init/.agents/skills/earos-create/references/dependency-rules.md +193 -0
  15. package/assets/init/.agents/skills/earos-create/references/rubric-interview-guide.md +123 -0
  16. package/assets/init/.agents/skills/earos-create/references/validation-checklist.md +238 -0
  17. package/assets/init/.agents/skills/earos-profile-author/SKILL.md +251 -0
  18. package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md +280 -0
  19. package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md +158 -0
  20. package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md +173 -0
  21. package/assets/init/.agents/skills/earos-remediate/SKILL.md +118 -0
  22. package/assets/init/.agents/skills/earos-remediate/references/output-template.md +199 -0
  23. package/assets/init/.agents/skills/earos-remediate/references/remediation-patterns.md +330 -0
  24. package/assets/init/.agents/skills/earos-report/SKILL.md +85 -0
  25. package/assets/init/.agents/skills/earos-report/references/portfolio-template.md +181 -0
  26. package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md +168 -0
  27. package/assets/init/.agents/skills/earos-review/SKILL.md +130 -0
  28. package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md +163 -0
  29. package/assets/init/.agents/skills/earos-review/references/output-template.md +180 -0
  30. package/assets/init/.agents/skills/earos-template-fill/SKILL.md +177 -0
  31. package/assets/init/.agents/skills/earos-template-fill/references/evidence-writing-guide.md +186 -0
  32. package/assets/init/.agents/skills/earos-template-fill/references/section-rubric-mapping.md +200 -0
  33. package/assets/init/.agents/skills/earos-validate/SKILL.md +113 -0
  34. package/assets/init/.agents/skills/earos-validate/references/fix-patterns.md +281 -0
  35. package/assets/init/.agents/skills/earos-validate/references/validation-checks.md +287 -0
  36. package/assets/init/.claude/CLAUDE.md +4 -0
  37. package/assets/init/AGENTS.md +293 -0
  38. package/assets/init/CLAUDE.md +635 -0
  39. package/assets/init/README.md +507 -0
  40. package/assets/init/calibration/gold-set/.gitkeep +0 -0
  41. package/assets/init/calibration/results/.gitkeep +0 -0
  42. package/assets/init/core/core-meta-rubric.yaml +643 -0
  43. package/assets/init/docs/consistency-report.md +325 -0
  44. package/assets/init/docs/getting-started.md +194 -0
  45. package/assets/init/docs/profile-authoring-guide.md +51 -0
  46. package/assets/init/docs/terminology.md +126 -0
  47. package/assets/init/earos.manifest.yaml +104 -0
  48. package/assets/init/evaluations/.gitkeep +0 -0
  49. package/assets/init/examples/aws-event-driven-order-processing/artifact.yaml +2056 -0
  50. package/assets/init/examples/aws-event-driven-order-processing/evaluation.yaml +973 -0
  51. package/assets/init/examples/aws-event-driven-order-processing/report.md +244 -0
  52. package/assets/init/examples/example-solution-architecture.evaluation.yaml +136 -0
  53. package/assets/init/examples/multi-cloud-data-analytics/artifact.yaml +715 -0
  54. package/assets/init/overlays/data-governance.yaml +94 -0
  55. package/assets/init/overlays/regulatory.yaml +154 -0
  56. package/assets/init/overlays/security.yaml +92 -0
  57. package/assets/init/profiles/adr.yaml +225 -0
  58. package/assets/init/profiles/capability-map.yaml +223 -0
  59. package/assets/init/profiles/reference-architecture.yaml +426 -0
  60. package/assets/init/profiles/roadmap.yaml +205 -0
  61. package/assets/init/profiles/solution-architecture.yaml +227 -0
  62. package/assets/init/research/architecture-assessment-rubrics-research.docx +0 -0
  63. package/assets/init/research/architecture-assessment-rubrics-research.md +566 -0
  64. package/assets/init/research/reference-architecture-research.md +751 -0
  65. package/assets/init/standard/EAROS.md +1426 -0
  66. package/assets/init/standard/schemas/artifact.schema.json +1295 -0
  67. package/assets/init/standard/schemas/artifact.uischema.json +65 -0
  68. package/assets/init/standard/schemas/evaluation.schema.json +284 -0
  69. package/assets/init/standard/schemas/rubric.schema.json +383 -0
  70. package/assets/init/templates/evaluation-record.template.yaml +58 -0
  71. package/assets/init/templates/new-profile.template.yaml +65 -0
  72. package/bin.js +188 -0
  73. package/dist/assets/_basePickBy-BVu6YmSW.js +1 -0
  74. package/dist/assets/_baseUniq-CWRzQDz_.js +1 -0
  75. package/dist/assets/arc-CyDBhtDM.js +1 -0
  76. package/dist/assets/architectureDiagram-2XIMDMQ5-BH6O4dvN.js +36 -0
  77. package/dist/assets/blockDiagram-WCTKOSBZ-2xmwdjpg.js +132 -0
  78. package/dist/assets/c4Diagram-IC4MRINW-BNmPRFJF.js +10 -0
  79. package/dist/assets/channel-CiySTNoJ.js +1 -0
  80. package/dist/assets/chunk-4BX2VUAB-DGQTvirp.js +1 -0
  81. package/dist/assets/chunk-55IACEB6-DNMAQAC_.js +1 -0
  82. package/dist/assets/chunk-FMBD7UC4-BJbVTQ5o.js +15 -0
  83. package/dist/assets/chunk-JSJVCQXG-BCxUL74A.js +1 -0
  84. package/dist/assets/chunk-KX2RTZJC-H7wWZOfz.js +1 -0
  85. package/dist/assets/chunk-NQ4KR5QH-BK4RlTQF.js +220 -0
  86. package/dist/assets/chunk-QZHKN3VN-0chxDV5g.js +1 -0
  87. package/dist/assets/chunk-WL4C6EOR-DexfQ-AV.js +189 -0
  88. package/dist/assets/classDiagram-VBA2DB6C-D7luWJQn.js +1 -0
  89. package/dist/assets/classDiagram-v2-RAHNMMFH-D7luWJQn.js +1 -0
  90. package/dist/assets/clone-ylgRbd3D.js +1 -0
  91. package/dist/assets/cose-bilkent-S5V4N54A-DS2IOCfZ.js +1 -0
  92. package/dist/assets/cytoscape.esm-CyJtwmzi.js +331 -0
  93. package/dist/assets/dagre-KLK3FWXG-BbSoTTa3.js +4 -0
  94. package/dist/assets/defaultLocale-DX6XiGOO.js +1 -0
  95. package/dist/assets/diagram-E7M64L7V-C9TvYgv0.js +24 -0
  96. package/dist/assets/diagram-IFDJBPK2-DowUMWrg.js +43 -0
  97. package/dist/assets/diagram-P4PSJMXO-BL6nrnQF.js +24 -0
  98. package/dist/assets/erDiagram-INFDFZHY-rXPRl8VM.js +70 -0
  99. package/dist/assets/flowDiagram-PKNHOUZH-DBRM99-W.js +162 -0
  100. package/dist/assets/ganttDiagram-A5KZAMGK-INcWFsBT.js +292 -0
  101. package/dist/assets/gitGraphDiagram-K3NZZRJ6-DMwpfE91.js +65 -0
  102. package/dist/assets/graph-DLQn37b-.js +1 -0
  103. package/dist/assets/index-BFFITMT8.js +650 -0
  104. package/dist/assets/index-H7f6VTz1.css +1 -0
  105. package/dist/assets/infoDiagram-LFFYTUFH-B0f4TWRM.js +2 -0
  106. package/dist/assets/init-Gi6I4Gst.js +1 -0
  107. package/dist/assets/ishikawaDiagram-PHBUUO56-CsU6XimZ.js +70 -0
  108. package/dist/assets/journeyDiagram-4ABVD52K-CQ7ibNib.js +139 -0
  109. package/dist/assets/kanban-definition-K7BYSVSG-DzEN7THt.js +89 -0
  110. package/dist/assets/katex-B1X10hvy.js +261 -0
  111. package/dist/assets/layout-C0dvb42R.js +1 -0
  112. package/dist/assets/linear-j4a8mGj7.js +1 -0
  113. package/dist/assets/mindmap-definition-YRQLILUH-DP8iEuCf.js +68 -0
  114. package/dist/assets/ordinal-Cboi1Yqb.js +1 -0
  115. package/dist/assets/pieDiagram-SKSYHLDU-BpIAXgAm.js +30 -0
  116. package/dist/assets/quadrantDiagram-337W2JSQ-DrpXn5Eg.js +7 -0
  117. package/dist/assets/requirementDiagram-Z7DCOOCP-Bg7EwHlG.js +73 -0
  118. package/dist/assets/sankeyDiagram-WA2Y5GQK-BWagRs1F.js +10 -0
  119. package/dist/assets/sequenceDiagram-2WXFIKYE-q5jwhivG.js +145 -0
  120. package/dist/assets/stateDiagram-RAJIS63D-B_J9pE-2.js +1 -0
  121. package/dist/assets/stateDiagram-v2-FVOUBMTO-Q_1GcybB.js +1 -0
  122. package/dist/assets/timeline-definition-YZTLITO2-dv0jgQ0z.js +61 -0
  123. package/dist/assets/treemap-KZPCXAKY-Dt1dkIE7.js +162 -0
  124. package/dist/assets/vennDiagram-LZ73GAT5-BdO5RgRZ.js +34 -0
  125. package/dist/assets/xychartDiagram-JWTSCODW-CpDVe-8v.js +7 -0
  126. package/dist/index.html +23 -0
  127. package/export-docx.js +1583 -0
  128. package/init.js +353 -0
  129. package/manifest-cli.mjs +207 -0
  130. package/package.json +83 -0
  131. package/schemas/artifact.schema.json +1295 -0
  132. package/schemas/artifact.uischema.json +65 -0
  133. package/schemas/evaluation.schema.json +284 -0
  134. package/schemas/rubric.schema.json +383 -0
  135. package/serve.js +238 -0
@@ -0,0 +1,1426 @@
1
+ # Enterprise Architecture Rubric Operational Standard (EAROS)
2
+
3
+ *Version: 2.0
4
+ Status: Draft for review
5
+ Audience: Enterprise architects, architecture boards, design authorities, reviewers, and LLM/automation teams*
6
+
7
+ > See [docs/terminology.md](../docs/terminology.md) for definitions of all technical terms used in this standard.
8
+
9
+ ## 1. Purpose
10
+
11
+ This document defines an operational standard for creating, governing, extending, and applying rubrics to enterprise architecture artifacts and related architecture work products.
12
+
13
+ The goal is to make architecture evaluation more consistent, more explainable, and more automatable. The standard is designed for both human review and agentic application. It standardizes:
14
+
15
+ * what a rubric is
16
+ * how a rubric is structured
17
+ * how criteria are scored
18
+ * how evidence is recorded
19
+ * how artifact-specific profiles are added
20
+ * how results are documented and governed
21
+ * how agents can help develop and apply rubrics safely
22
+
23
+ This standard assumes a simple but important distinction:
24
+
25
+ 1. **Artifact quality** — whether the artifact is complete, coherent, clear, traceable, and fit for its stated purpose
26
+ 2. **Architectural fitness** — whether the architecture described appears sound relative to business drivers, quality attributes, risks, and tradeoffs
27
+ 3. **Governance fit** — whether the artifact and/or proposed design complies with mandatory principles, standards, controls, and review expectations
28
+
29
+ These must not be collapsed into one opaque score. They are related but distinct judgments.
30
+
31
+
32
+ ### What is new in version 2.0
33
+
34
+ Version 2.0 incorporates findings from a comprehensive research programme conducted in March 2026, covering 63 sources across academic literature, industry frameworks, and emerging AI evaluation methods. Key enhancements include:
35
+
36
+ 1. **Strengthened design foundations** with explicit alignment to Microsoft's LLM-Rubric (ACL 2024), the RULERS framework (2026), and the CARO confusion-aware rubric optimization approach.
37
+ 2. **New principle on artifact machine-readability** (Principle 8), establishing requirements for structured, parseable artifact formats.
38
+ 3. **Expanded scoring guidance** with discussion of the 0–3 vs 0–4 scale tradeoff based on empirical evidence.
39
+ 4. **Enhanced agentic operating pattern** with DAG-based evaluation flow, the RULERS evidence-anchoring protocol, and calibrated aggregation.
40
+ 5. **New section on artifact format requirements** (Section 12) covering ArchiMate exchange format, ADR machine-readable formats, diagram-as-code, and metadata conventions.
41
+ 6. **Expanded calibration methodology** with explicit inter-rater reliability targets and weighted kappa benchmarks.
42
+ 7. **New section on rubric design for AI agents** (Section 13) with seven empirically grounded design principles.
43
+ 8. **Updated references** with 63 additional sources.
44
+
45
+ ## 2. Design foundations
46
+
47
+ This standard is intentionally aligned to widely used architecture and evaluation concepts:
48
+
49
+ * ISO/IEC/IEEE 42010 frames architecture descriptions around stakeholders, concerns, viewpoints, and architecture descriptions rather than generic documentation alone. [R1]
50
+ * The Open Group describes architecture compliance review as a scrutiny of a project against established architectural criteria, spirit, and business objectives. [R2]
51
+ * SEI’s ATAM treats architecture evaluation as an explicit examination of quality attribute goals, tradeoffs, and risks that may inhibit business goals. [R3]
52
+ * NIST’s AI RMF Playbook is useful as an operational pattern because it combines governance, mapping, measurement, and management in an iterative, non-checklist way and is also available in structured formats such as JSON and CSV. [R4]
53
+ * ISO/IEC 25010:2023 provides a reference model in which quality characteristics are defined so they can be specified, measured, and evaluated. [R5]
54
+
55
+ * Microsoft's LLM-Rubric (ACL 2024) demonstrates that multidimensional, calibrated rubric evaluation using LLMs can achieve 2x improvement in alignment with human judges when scoring is decomposed per dimension and combined through a learned aggregation function. [R7]
56
+ * The RULERS framework (2026) addresses three failure modes of LLM judges — instability under prompt variations, unverifiable reasoning, and distributional misalignment — through rubric locking, evidence-anchored scoring, and Wasserstein-based calibration. [R8]
57
+ * Snorkel AI's rubric engineering methodology treats rubrics as models that should be co-developed with domain experts, calibrated until inter-rater reliability stabilizes, and measured using the same agreement statistics (Cohen’s kappa, QWK, ICC) as human annotators. [R9]
58
+ * The CARO framework (Confusion-Aware Rubric Optimization, 2026) demonstrates that rubric ambiguity can be systematically identified and resolved through targeted rule patches that address specific misclassification patterns without degrading global performance. [R10]
59
+ * AWS's Well-Architected AI assessment using Amazon Bedrock provides a production-validated pattern for automated architecture review against structured pillar-based frameworks. [R11]
60
+ * A 2024 ACM Computing Surveys systematic literature review of 109 EA evaluation methods confirms that automation of assessment and architecture modeling processes is the critical factor for widespread adoption. [R12]
61
+
62
+ The standard below is therefore not just a scorecard template. It is an operating model for architecture evaluation.
63
+
64
+ ## 3. Core principles
65
+
66
+ ### 3.1 Principle 1 — Concern-driven, not document-driven
67
+
68
+ Rubrics evaluate whether an artifact answers the stakeholder concerns it exists to address. A beautiful artifact that does not answer the decision at hand should not score well.
69
+
70
+ ### 3.2 Principle 2 — Evidence first
71
+
72
+ Every score must point to evidence in the artifact or explicitly state that evidence is missing.
73
+
74
+ ### 3.3 Principle 3 — Gates before averages
75
+
76
+ Mandatory failures must not be hidden by weighted averages.
77
+
78
+ ### 3.4 Principle 4 — Explainability over fake precision
79
+
80
+ Prefer a disciplined ordinal scale with clear guidance over false numerical granularity.
81
+
82
+ ### 3.5 Principle 5 — Separate observation from inference
83
+
84
+ Reviewers and agents must distinguish:
85
+
86
+ * **Observed**: directly supported by artifact evidence
87
+ * **Inferred**: reasonable interpretation not directly stated
88
+ * **External**: judgment based on a standard, policy, or source outside the artifact
89
+
90
+ ### 3.6 Principle 6 — Rubrics are governed assets
91
+
92
+ Rubrics, profiles, and overlays are versioned, reviewed, calibrated, and changed under governance.
93
+
94
+ ### 3.7 Principle 7 — Agentic use must remain auditable
95
+
96
+ An agent may propose, score, summarize, and challenge, but the evaluation record must remain inspectable by a human reviewer.
97
+
98
+ ### 3.8 Principle 8 — Machine-readable where possible
99
+
100
+ Architecture artifacts should be structured in machine-parseable formats to enable reliable automated extraction, validation, and assessment. This includes structured metadata (YAML frontmatter), schema-validated models (ArchiMate exchange format, OpenAPI), diagram-as-code representations, and consistent section markers. Machine-readability is not an end in itself but a prerequisite for consistent, scalable, and reproducible evaluation.
101
+
102
+ ## 4. Scope
103
+
104
+ This standard can be applied to any architecture artifact, including but not limited to:
105
+
106
+ * enterprise capability maps
107
+ * business architecture views
108
+ * application and integration architecture diagrams
109
+ * target-state architectures
110
+ * transition architectures
111
+ * solution architectures
112
+ * architecture decision records (ADRs)
113
+ * principle sets
114
+ * technology standards profiles
115
+ * roadmaps
116
+ * data architecture artifacts
117
+ * security architecture artifacts
118
+ * operating model and platform strategy documents
119
+
120
+ ## 5. Operating model
121
+
122
+ The standard uses a three-layer composition model.
123
+
124
+ ### 5.1 Layer 1 — Core meta-rubric
125
+
126
+ The core meta-rubric applies to **all** architecture artifacts. It standardizes the base dimensions and scoring model.
127
+
128
+ ### 5.2 Layer 2 — Artifact profile
129
+
130
+ Each artifact type has a profile that adds or specializes criteria for that artifact class.
131
+
132
+ Examples:
133
+
134
+ * Capability Map Profile
135
+ * Solution Architecture Profile
136
+ * Target State Profile
137
+ * ADR Profile
138
+ * Roadmap Profile
139
+
140
+ ### 5.3 Layer 3 — Context overlay
141
+
142
+ Overlays add criteria or gates required by a specific context.
143
+
144
+ Examples:
145
+
146
+ * Security Overlay
147
+ * Data Governance Overlay
148
+ * Regulatory Overlay
149
+ * Cloud Platform Overlay
150
+ * Critical Production Change Overlay
151
+
152
+ This layered model is important because one global rubric becomes too generic, while fully bespoke rubrics become ungovernable.
153
+
154
+ ## 6. Standard rubric anatomy
155
+
156
+ Every rubric, profile, and overlay must include the following fields.
157
+
158
+ ### 6.1 Mandatory metadata
159
+
160
+ * rubric\_id
161
+ * title
162
+ * version
163
+ * status (draft, approved, deprecated)
164
+ * owner
165
+ * artifact\_type
166
+ * review\_purpose
167
+ * intended\_decision\_type
168
+ * applicable\_stakeholders
169
+ * applicable\_viewpoints
170
+ * scale\_definition
171
+ * status\_rules
172
+ * change\_log
173
+ * effective\_date
174
+
175
+ ### 6.2 Mandatory criterion fields
176
+
177
+ Every criterion must define:
178
+
179
+ * criterion\_id
180
+ * name
181
+ * dimension
182
+ * question
183
+ * description
184
+ * metric\_type
185
+ * score\_scale
186
+ * weight
187
+ * gate\_type
188
+ * required\_evidence
189
+ * scoring\_guidance
190
+ * anti\_patterns
191
+ * dependencies
192
+ * applicability\_rule
193
+ * rationale
194
+
195
+ ### 6.3 Mandatory evaluation output fields
196
+
197
+ Every evaluation result must capture:
198
+
199
+ * artifact\_id
200
+ * artifact\_type
201
+ * rubric\_id
202
+ * rubric\_version
203
+ * evaluated\_by
204
+ * evaluation\_mode (human, agent, hybrid)
205
+ * evaluation\_date
206
+ * criterion\_results
207
+ * dimension\_results
208
+ * gate\_failures
209
+ * overall\_status
210
+ * overall\_score
211
+ * confidence
212
+ * evidence\_gaps
213
+ * recommended\_actions
214
+ * decision\_summary
215
+
216
+ ## 7. Scoring standard
217
+
218
+ ### 7.1 Recommended scale
219
+
220
+ Use a **0–4 ordinal scale** plus N/A.
221
+
222
+ * **0 — Absent / contradicted**
223
+ No meaningful evidence, or evidence directly contradicts the criterion
224
+ * **1 — Weak**
225
+ Criterion is acknowledged or implied, but inadequate for decision support
226
+ * **2 — Partial**
227
+ Criterion is explicitly addressed, but coverage is incomplete, inconsistent, or weakly evidenced
228
+ * **3 — Good**
229
+ Criterion is clearly addressed with adequate evidence and only minor gaps
230
+ * **4 — Strong**
231
+ Criterion is fully addressed, well evidenced, internally consistent, and decision-ready
232
+ * **N/A — Not applicable**
233
+ Criterion genuinely does not apply in the stated scope and context
234
+
235
+ ### 7.2 Why this scale
236
+
237
+ The 0–4 scale is strong enough to distinguish quality levels but simple enough for humans and agents to apply consistently. A 1–10 scale creates false precision and lowers calibration quality.
238
+
239
+ ### 7.3 Gate types
240
+
241
+ Use explicit gate categories:
242
+
243
+ * none — contributes to score only
244
+ * advisory — weak performance triggers recommendations
245
+ * major — significant weakness may cap status
246
+ * critical — failure blocks pass status regardless of average
247
+
248
+ ### 7.4 Status model
249
+
250
+ A recommended default decision model:
251
+
252
+ * **Pass**
253
+ + no critical gate failures
254
+ + overall weighted score >= 3.2
255
+ + no dimension score < 2.0
256
+ * **Conditional pass**
257
+ + no critical gate failures
258
+ + overall weighted score between 2.4 and 3.19
259
+ + weaknesses are containable with named actions and owners
260
+ * **Rework required**
261
+ + overall weighted score < 2.4
262
+ + or repeated weak dimensions
263
+ + or insufficient evidence for decision support
264
+ * **Not reviewable / reject**
265
+ + critical gate failure
266
+ + artifact does not match the declared type
267
+ + artifact purpose is unclear
268
+ + evidence is too incomplete to score responsibly
269
+
270
+ ### 7.5 Confidence standard
271
+
272
+ Confidence must be recorded separately from score.
273
+
274
+ Use:
275
+
276
+ * low
277
+ * medium
278
+ * high
279
+
280
+ Confidence is driven by:
281
+
282
+ * completeness of evidence
283
+ * clarity of scope
284
+ * ambiguity of criterion
285
+ * consistency across sections/views
286
+ * reviewer certainty
287
+
288
+ Do **not** multiply score by confidence. A low-confidence evaluation and a poor artifact are different problems.
289
+
290
+
291
+ ### 7.6 Scale design tradeoff: 0–3 versus 0–4
292
+
293
+ Empirical research on LLM-based evaluation suggests that smaller scales (0–3) produce higher inter-rater reliability between AI agents and human reviewers. The 0–3 scale reduces the ambiguous middle zone that plagues wider scales.
294
+
295
+ However, the 0–4 scale used in EAROS v1 was chosen deliberately to distinguish between "good" (adequate with minor gaps) and "strong" (fully addressed and decision-ready). This distinction matters in architecture governance because the gap between "adequate" and "excellent" is precisely where reviewer judgment adds most value.
296
+
297
+ **Recommended approach:** Retain the 0–4 scale for human and hybrid evaluation modes. For fully agentic evaluation, organisations may optionally collapse to a 0–3 scale by merging levels 3 and 4, provided the rubric profiles define the mapping explicitly. Any scale collapse must be recorded in the evaluation metadata.
298
+
299
+ ### 7.7 Inter-rater reliability targets
300
+
301
+ Based on empirical evidence from the RULERS framework, Microsoft's LLM-Rubric, and Snorkel AI's calibration methodology, the following inter-rater reliability targets apply:
302
+
303
+ * **Binary criteria** (present/absent): target agreement > 95% between agent and human reviewer
304
+ * **Ordinal criteria** (0–4 scale): target weighted Cohen’s kappa > 0.70 (substantial agreement) for well-defined criteria; > 0.50 (moderate agreement) for subjective criteria
305
+ * **Overall assessment**: target Spearman’s rho > 0.80 correlation with expert human reviewers
306
+ * **Adjacent-score tolerance**: for ordinal criteria, disagreements of exactly one level are treated as soft disagreements in calibration metrics
307
+
308
+ These targets should be validated through calibration exercises during profile development and monitored in production.
309
+
310
+
311
+ ## 8. Core meta-rubric dimensions
312
+
313
+ The core meta-rubric should apply across all architecture artifact types.
314
+
315
+ ### 8.1 Stakeholder and purpose fit
316
+
317
+ Is the artifact explicit about who it serves, what decision it supports, and why it exists?
318
+
319
+ ### 8.2 Scope and boundary clarity
320
+
321
+ Does it define scope, exclusions, assumptions, constraints, and boundaries clearly enough to avoid misreading?
322
+
323
+ ### 8.3 Concern coverage and viewpoint appropriateness
324
+
325
+ Does the artifact address the relevant concerns using views or structures that fit the problem?
326
+
327
+ ### 8.4 Traceability
328
+
329
+ Can the artifact be traced to business drivers, requirements, principles, policies, and key decisions?
330
+
331
+ ### 8.5 Internal consistency and integrity
332
+
333
+ Are the claims, models, views, and narratives internally coherent, or do they conflict?
334
+
335
+ ### 8.6 Risk, assumptions, constraints, and tradeoffs
336
+
337
+ Does it identify meaningful risks, assumptions, constraints, and architectural tradeoffs?
338
+
339
+ ### 8.7 Standards and policy compliance
340
+
341
+ Does it demonstrate alignment with mandatory standards, controls, and governance expectations?
342
+
343
+ ### 8.8 Actionability and implementation relevance
344
+
345
+ Can downstream teams act on it? Does it inform design, sequencing, funding, governance, or delivery choices?
346
+
347
+ ### 8.9 Maintainability of the artifact
348
+
349
+ Is ownership clear? Is the artifact versioned, current enough, and structured so it can be kept useful over time?
350
+
351
+ ## 9. Metric types
352
+
353
+ Criteria may use different measurement patterns, but they must be declared explicitly.
354
+
355
+ ### 9.1 Ordinal
356
+
357
+ Used when maturity or adequacy is judged on a defined rubric scale.
358
+ Example: “How complete is traceability to business capabilities?”
359
+
360
+ ### 9.2 Binary
361
+
362
+ Used for hard checks.
363
+ Example: “Is the artifact owner named?”
364
+
365
+ ### 9.3 Enumerated categorical
366
+
367
+ Used when one of a small set of states is appropriate.
368
+ Example: none, partial, full, not-applicable
369
+
370
+ ### 9.4 Quantified ratio or count
371
+
372
+ Used sparingly.
373
+ Example: “Percentage of diagrams with legend and scope annotation”
374
+
375
+ ### 9.5 Derived metric
376
+
377
+ Computed from multiple criterion results.
378
+ Example: concern coverage index, evidence sufficiency index, reviewability score
379
+
380
+ Quantified metrics can be useful, but architecture evaluation is usually strongest when the quantitative element supports rather than replaces reasoned judgment.
381
+
382
+ ## 10. Evidence standard
383
+
384
+ ### 10.1 Evidence anchors
385
+
386
+ Each criterion must specify what counts as acceptable evidence.
387
+
388
+ Typical evidence anchors:
389
+
390
+ * section references
391
+ * page numbers
392
+ * diagram identifiers
393
+ * ADR identifiers
394
+ * traceability table rows
395
+ * standards mappings
396
+ * risk records
397
+ * roadmap milestones
398
+ * linked source documents
399
+
400
+ ### 10.2 Evidence sufficiency
401
+
402
+ Record one of:
403
+
404
+ * sufficient
405
+ * partial
406
+ * insufficient
407
+ * none
408
+
409
+ ### 10.3 Evidence classes
410
+
411
+ Record each finding as:
412
+
413
+ * observed
414
+ * inferred
415
+ * external
416
+
417
+ ### 10.4 Missing evidence is a first-class result
418
+
419
+ Evaluators must not silently fill gaps. Missing evidence must be visible in the record.
420
+
421
+ ## 11. Documentation standard
422
+
423
+ Every rubric package must contain five documentation assets.
424
+
425
+ ### 11.1 Rubric specification
426
+
427
+ The normative definition of dimensions, criteria, scale, gates, weights, applicability, and outputs.
428
+
429
+ ### 11.2 Scoring guidance
430
+
431
+ A reviewer guide with:
432
+
433
+ * interpretation notes
434
+ * examples
435
+ * counterexamples
436
+ * anti-patterns
437
+ * boundary cases
438
+ * tie-break guidance
439
+
440
+ ### 11.3 Calibration pack
441
+
442
+ A set of sample artifacts with benchmark evaluations and rationale.
443
+
444
+ ### 11.4 Evaluation record template
445
+
446
+ The format for storing applied scores, evidence, rationale, gaps, and actions.
447
+
448
+ ### 11.5 Decision log template
449
+
450
+ The format for recording disposition, review outcome, waivers, owners, due dates, and exceptions.
451
+
452
+ ## 12. Agentic operating pattern
453
+
454
+ ### 12.1 Recommended evaluation pattern
455
+
456
+ Use a **two-pass or three-pass model**.
457
+
458
+ #### Pass 1 — Extractor
459
+
460
+ An agent extracts evidence from the artifact, identifies sections, diagrams, traceability elements, and candidate evidence for each criterion.
461
+
462
+ #### Pass 2 — Evaluator
463
+
464
+ A second agent applies the rubric and produces criterion scores, rationale, confidence, and gaps.
465
+
466
+ #### Pass 3 — Challenger
467
+
468
+ A third agent challenges the evaluation by looking for:
469
+
470
+ * unsupported claims
471
+ * contradictory evidence
472
+ * rubric misuse
473
+ * over-scoring
474
+ * unacknowledged ambiguity
475
+ * missing gating issues
476
+
477
+ This pattern is safer than single-agent scoring because architecture review is vulnerable to confident but weak inference.
478
+
479
+ ### 12.2 Human review threshold
480
+
481
+ Require human review when:
482
+
483
+ * any critical gate fails
484
+ * confidence is low on a major criterion
485
+ * overall status changes because of inference rather than direct evidence
486
+ * the challenger disagrees materially with the evaluator
487
+ * the artifact is high impact or high risk
488
+
489
+ ### 12.3 Agent output standard
490
+
491
+ Agent-generated outputs must contain:
492
+
493
+ * criterion-by-criterion result
494
+ * evidence references
495
+ * rationale
496
+ * confidence
497
+ * explicit gaps
498
+ * recommended actions
499
+ * model/version metadata if required by internal policy
500
+
501
+
502
+ ### 12.4 DAG-based evaluation flow
503
+
504
+ For complex artifacts, structure the agentic evaluation as a Directed Acyclic Graph (DAG) rather than a linear three-pass model. Each node in the DAG is a specialised evaluation step that produces typed outputs consumed by downstream nodes.
505
+
506
+ Recommended DAG structure:
507
+
508
+ 1. **Structural validation** (binary checks): metadata completeness, section presence, schema conformance, naming conventions
509
+ 2. **Content extraction**: extract evidence for each criterion from identified sections
510
+ 3. **Criterion scoring** (parallel): each criterion scored independently with evidence anchoring
511
+ 4. **Cross-reference validation**: check consistency between criterion scores and between artifact sections
512
+ 5. **Dimension aggregation**: compute dimension-level scores from criterion results
513
+ 6. **Challenge pass**: identify unsupported claims, contradictions, over-scoring, and unacknowledged ambiguity
514
+ 7. **Calibration**: apply post-hoc calibration using the RULERS Wasserstein-based method or LLM-Rubric aggregation network
515
+ 8. **Status determination**: apply gate logic and threshold rules to determine overall status
516
+
517
+ This DAG structure is safer than a monolithic three-pass model because each node has a narrow, well-defined scope. It also enables partial re-evaluation when an artifact is updated.
518
+
519
+ ### 12.5 Evidence-anchoring protocol
520
+
521
+ Following the RULERS framework, every agent-generated score must include:
522
+
523
+ * **Citation**: a specific reference to the artifact (section, page, paragraph, diagram ID, table row)
524
+ * **Quotation**: the relevant text or description from the artifact that supports the score
525
+ * **Evidence class**: observed, inferred, or external
526
+ * **Sufficiency**: whether the evidence is sufficient, partial, or insufficient for the criterion
527
+ * **Confidence**: low, medium, or high, with explicit reasons for anything below high
528
+
529
+ Scores without evidence anchors must be flagged as unverified and excluded from automated status determination.
530
+
531
+ ### 12.6 Rubric locking for agent evaluation
532
+
533
+ Rubrics used in agentic evaluation must be compiled into immutable, versioned specifications before use. This prevents interpretation drift across evaluation runs. Changes to rubric wording, examples, or scoring guidance require a new rubric version and recalibration.
534
+
535
+ The locked rubric specification should include the full criterion text, all level descriptors, all examples and anti-patterns, all gate rules, and the aggregation function. It must be stored in a machine-readable format (YAML or JSON) conforming to the rubric schema.
536
+
537
+
538
+ ## 13. Calibration and quality control
539
+
540
+ ### 13.1 Why calibration matters
541
+
542
+ Without calibration, the same rubric will drift across teams, reviewers, and agents.
543
+
544
+ ### 13.2 Calibration method
545
+
546
+ For a new rubric or profile:
547
+
548
+ 1. select 10–20 representative artifacts
549
+ 2. score them independently with at least two reviewers or two review modes
550
+ 3. compare results
551
+ 4. identify ambiguous criteria and rewrite scoring guidance
552
+ 5. repeat until agreement is stable enough for operational use
553
+
554
+ ### 13.3 Agreement metrics
555
+
556
+ For ordinal scoring, weighted kappa is a useful agreement measure because it accounts for ordered disagreements rather than treating all disagreements equally. [R6]
557
+
558
+ ### 13.4 What to calibrate
559
+
560
+ Calibrate:
561
+
562
+ * score distributions
563
+ * gate consistency
564
+ * treatment of N/A
565
+ * confidence usage
566
+ * evidence sufficiency judgments
567
+ * action recommendations
568
+
569
+ ### 13.5 Recalibration triggers
570
+
571
+ Recalibrate when:
572
+
573
+ * a profile changes materially
574
+ * a new overlay is introduced
575
+ * agreement drops
576
+ * new artifact formats appear
577
+ * agent behavior changes materially
578
+ * governance expectations change
579
+
580
+ ## 14. Profile model
581
+
582
+ Profiles are the main extension mechanism for different artifact types.
583
+
584
+ ### 14.1 What a profile is
585
+
586
+ A profile is a rubric extension for a specific artifact class. It inherits the core meta-rubric and adds, removes, constrains, or specializes criteria.
587
+
588
+ ### 14.2 What a profile is not
589
+
590
+ A profile is not:
591
+
592
+ * a totally independent scoring system
593
+ * a one-off project checklist
594
+ * a domain overlay masquerading as an artifact type
595
+ * a replacement for the core rubric
596
+
597
+ ### 14.3 Profile composition rules
598
+
599
+ A valid profile must:
600
+
601
+ * inherit the core scale and status model unless there is an approved exception
602
+ * map every added criterion to a dimension
603
+ * define applicability rules
604
+ * define evidence anchors
605
+ * define gate types explicitly
606
+ * explain any profile-specific weights
607
+ * include examples and anti-patterns
608
+ * include at least one calibration artifact before approval
609
+
610
+ ## 15. Methods for adding profiles
611
+
612
+ This is the part most teams get wrong. They add profiles by brainstorming criteria. That produces bloated and unstable rubrics. Profiles should be added through a controlled method.
613
+
614
+ ### Method A — Decision-centered profile design
615
+
616
+ Use this when the artifact exists to support a specific decision type.
617
+
618
+ **Best for:** ADRs, investment review artifacts, target-state decisions, exception requests
619
+
620
+ Steps:
621
+
622
+ 1. Identify the decision the artifact is supposed to support
623
+ 2. Identify the minimum questions decision-makers must answer
624
+ 3. Map those questions to the core dimensions
625
+ 4. Add only the profile criteria needed to make the decision responsibly
626
+ 5. Declare gates for mandatory questions
627
+ 6. Write examples of good and weak evidence
628
+ 7. Pilot on 3–5 real artifacts
629
+
630
+ **Description:**
631
+ This method keeps profiles lean and tightly linked to real governance choices. It is the best default method for architecture boards.
632
+
633
+ ### Method B — Viewpoint-centered profile design
634
+
635
+ Use this when the artifact is primarily a structured architecture description for a viewpoint or stakeholder group.
636
+
637
+ **Best for:** capability maps, business architecture views, platform reference architectures, integration landscape maps
638
+
639
+ Steps:
640
+
641
+ 1. Identify stakeholders and concerns
642
+ 2. Identify the viewpoint or modeling style used
643
+ 3. Define what good coverage looks like for that viewpoint
644
+ 4. Add profile criteria for completeness, modeling integrity, scope, and traceability
645
+ 5. Add profile anti-patterns typical of that view
646
+ 6. Validate against existing examples
647
+
648
+ **Description:**
649
+ This method is closest to the architecture-description logic behind ISO/IEC/IEEE 42010. It is useful when the question is less “Is this decision right?” and more “Is this architecture view fit for purpose?” [R1]
650
+
651
+ ### Method C — Lifecycle-centered profile design
652
+
653
+ Use this when the artifact sits at a particular stage of the architecture or delivery lifecycle.
654
+
655
+ **Best for:** current-state assessments, transition-state designs, roadmap artifacts, operational handover architecture
656
+
657
+ Steps:
658
+
659
+ 1. Place the artifact in the lifecycle
660
+ 2. Identify what downstream users need next
661
+ 3. Define criteria for readiness, sequencing, dependencies, and handoff quality
662
+ 4. Add stage-specific gates
663
+ 5. Add evidence rules that prove the artifact is actionable
664
+
665
+ **Description:**
666
+ This method prevents profiles from becoming abstract. It forces the profile to answer: what happens next if this artifact is accepted?
667
+
668
+ ### Method D — Risk-centered profile design
669
+
670
+ Use this when the artifact is only worth reviewing because it manages a class of risk.
671
+
672
+ **Best for:** security architecture, regulatory impact architecture, resilience architecture, critical data architecture
673
+
674
+ Steps:
675
+
676
+ 1. Identify the risk classes and failure modes
677
+ 2. Identify mandatory controls and review obligations
678
+ 3. Create hard gates for non-negotiables
679
+ 4. Add criteria for residual risk visibility, ownership, and tradeoffs
680
+ 5. Attach the resulting requirements as a profile or, more often, as an overlay
681
+
682
+ **Description:**
683
+ This method is often better implemented as an overlay rather than a pure artifact profile, because the same risk lens may need to apply to multiple artifact types.
684
+
685
+ ### Method E — Pattern-library profile design
686
+
687
+ Use this when many similar artifacts recur and you want a reusable quality pattern.
688
+
689
+ **Best for:** recurring reference architectures, recurring integration patterns, recurring platform service definitions
690
+
691
+ Steps:
692
+
693
+ 1. identify the recurring artifact family
694
+ 2. extract the recurring success criteria
695
+ 3. turn them into profile criteria with examples
696
+ 4. separate mandatory pattern integrity from optional optimization
697
+ 5. publish and calibrate as a reusable profile pack
698
+
699
+ **Description:**
700
+ This is the best method when you want architecture governance to scale.
701
+
702
+ ## 16. Profile creation workflow
703
+
704
+ Use this workflow whenever a new profile is proposed.
705
+
706
+ ### Step 1 — Create a profile proposal
707
+
708
+ The proposal must include:
709
+
710
+ * candidate profile name
711
+ * artifact type
712
+ * intended decisions supported
713
+ * stakeholders
714
+ * why the core rubric alone is insufficient
715
+ * candidate dimensions affected
716
+ * proposed criteria list
717
+ * estimated review frequency
718
+ * expected users
719
+
720
+ ### Step 2 — Classify the profile type
721
+
722
+ Classify whether the proposal is:
723
+
724
+ * an artifact profile
725
+ * a context overlay
726
+ * a project-specific checklist
727
+ * a one-off temporary review aid
728
+
729
+ Only the first two should normally become governed assets.
730
+
731
+ ### Step 3 — Choose the profile design method
732
+
733
+ Use one of Methods A–E above and record the chosen method explicitly.
734
+
735
+ ### Step 4 — Draft the profile
736
+
737
+ Draft:
738
+
739
+ * profile scope
740
+ * inherited dimensions
741
+ * added/specialized criteria
742
+ * gating rules
743
+ * evidence anchors
744
+ * anti-patterns
745
+ * examples
746
+ * weight rationale
747
+
748
+ ### Step 5 — Build the calibration pack
749
+
750
+ Select representative artifacts, including:
751
+
752
+ * at least one strong example
753
+ * at least one weak example
754
+ * at least one ambiguous example
755
+ * at least one incomplete example
756
+
757
+ ### Step 6 — Run a calibration round
758
+
759
+ Use at least:
760
+
761
+ * one experienced human reviewer
762
+ * one secondary reviewer or agentic reviewer
763
+
764
+ ### Step 7 — Revise
765
+
766
+ Tighten criteria that produce inconsistent judgments.
767
+
768
+ ### Step 8 — Approve and publish
769
+
770
+ Approval should include:
771
+
772
+ * owner assignment
773
+ * version number
774
+ * effective date
775
+ * next review date
776
+
777
+ ### Step 9 — Monitor in production
778
+
779
+ Track:
780
+
781
+ * usage frequency
782
+ * common failure criteria
783
+ * calibration drift
784
+ * waiver frequency
785
+ * reviewer feedback
786
+ * agent disagreement rate
787
+
788
+ ## 17. Profile design rules
789
+
790
+ A profile should normally add **no more than 5–12 specific criteria** beyond the core meta-rubric. If a team wants to add 20 more criteria, it usually means the profile is mixing concerns that should be overlays or guidance notes instead.
791
+
792
+ ### 17.1 Good profile behavior
793
+
794
+ A good profile:
795
+
796
+ * sharpens evaluation for a real artifact class
797
+ * improves decision quality
798
+ * reduces reviewer ambiguity
799
+ * adds explicit evidence requirements
800
+ * remains teachable and calibratable
801
+
802
+ ### 17.2 Bad profile behavior
803
+
804
+ A bad profile:
805
+
806
+ * duplicates the entire core rubric
807
+ * adds vague criteria like “high quality”
808
+ * mixes artifact type and domain policy without distinction
809
+ * creates project-specific trivia as enterprise standard
810
+ * has no examples
811
+ * cannot be applied consistently
812
+
813
+ ## 18. Overlays versus profiles
814
+
815
+ This distinction matters a lot.
816
+
817
+ ### Use a profile when:
818
+
819
+ * the artifact type itself has unique quality expectations
820
+
821
+ ### Use an overlay when:
822
+
823
+ * a cross-cutting concern must apply to many artifact types
824
+
825
+ Examples:
826
+
827
+ * **ADR** -> profile
828
+ * **Capability Map** -> profile
829
+ * **Security** -> overlay
830
+ * **Regulatory** -> overlay
831
+ * **Data retention** -> overlay
832
+ * **Cloud landing zone policy** -> overlay
833
+
834
+ A common failure mode is encoding security or regulatory expectations directly inside every profile. That creates duplication and drift. Keep cross-cutting requirements as overlays unless there is a compelling reason not to.
835
+
836
+ ## 19. Starter profile examples
837
+
838
+ ### 19.1 Capability Map Profile
839
+
840
+ Likely extra criteria:
841
+
842
+ * decomposition quality
843
+ * non-overlap / non-duplication
844
+ * stable capability naming
845
+ * ownership clarity
846
+ * business outcome linkage
847
+ * level consistency
848
+ * strategic relevance
849
+
850
+ Typical anti-patterns:
851
+
852
+ * organization chart masquerading as capability map
853
+ * mixed abstraction levels
854
+ * solution names in place of capabilities
855
+ * duplicated capabilities
856
+
857
+ ### 19.2 Solution Architecture Profile
858
+
859
+ Likely extra criteria:
860
+
861
+ * problem statement clarity
862
+ * option analysis
863
+ * quality attribute treatment
864
+ * integration and dependency treatment
865
+ * operational model coverage
866
+ * deployment and runtime assumptions
867
+ * non-functional requirement traceability
868
+
869
+ Typical anti-patterns:
870
+
871
+ * only one option shown
872
+ * deployment view missing
873
+ * target design contradicts constraints
874
+ * security delegated to “later”
875
+
876
+ ### 19.3 ADR Profile
877
+
878
+ Likely extra criteria:
879
+
880
+ * decision statement clarity
881
+ * options considered
882
+ * consequences
883
+ * tradeoff visibility
884
+ * reversibility
885
+ * trigger conditions for revisit
886
+ * traceability to broader architecture
887
+
888
+ Typical anti-patterns:
889
+
890
+ * decision already implemented before record exists
891
+ * only chosen option documented
892
+ * no consequences listed
893
+ * no context for future readers
894
+
895
+ ### 19.4 Roadmap Profile
896
+
897
+ Likely extra criteria:
898
+
899
+ * dependency realism
900
+ * sequencing logic
901
+ * transition-state clarity
902
+ * owner and funding linkage
903
+ * risk and contingency treatment
904
+ * measurable milestones
905
+
906
+ Typical anti-patterns:
907
+
908
+ * date list without dependencies
909
+ * no transition architecture
910
+ * no ownership
911
+ * roadmap not connected to target state
912
+
913
+ ## 20. Governance model
914
+
915
+ ### 20.1 Roles
916
+
917
+ #### Rubric owner
918
+
919
+ Owns content, lifecycle, versioning, and quality of a rubric or profile.
920
+
921
+ #### Review authority
922
+
923
+ Approves major rubric changes and profile publication.
924
+
925
+ #### Evaluator
926
+
927
+ Applies the rubric.
928
+
929
+ #### Challenger
930
+
931
+ Reviews evaluation quality and disputes weak reasoning.
932
+
933
+ #### Calibration lead
934
+
935
+ Owns benchmark examples and agreement monitoring.
936
+
937
+ #### Agent steward
938
+
939
+ Owns prompts, agent workflows, schemas, and automation controls.
940
+
941
+ ### 20.2 Change classes
942
+
943
+ * **Patch change** — wording clarification, typo, example addition
944
+ * **Minor change** — added criterion guidance, anti-pattern updates, non-breaking applicability updates
945
+ * **Major change** — new criterion, changed scale, changed gates, changed weights, changed status rules
946
+
947
+ Major changes require recalibration.
948
+
949
+ ## 21. Metrics for rubric performance
950
+
951
+ Do not only measure artifact scores. Measure rubric performance itself.
952
+
953
+ Recommended operational metrics:
954
+
955
+ * rubric usage count
956
+ * profile usage count
957
+ * pass / conditional / rework / reject distribution
958
+ * critical gate failure frequency
959
+ * average evidence sufficiency
960
+ * reviewer agreement rate
961
+ * agent-human agreement rate
962
+ * waiver rate
963
+ * time to evaluate
964
+ * action closure rate
965
+ * defect escape correlation if available
966
+ * profile drift incidents
967
+ * stale rubric count
968
+
969
+ These metrics help determine whether the rubric system is actually improving architecture governance.
970
+
971
+ ## 22. Recommended repository structure
972
+
973
+ architecture-rubrics/
974
+ standard/
975
+ EAROS.md
976
+ rubric.schema.json
977
+ evaluation.schema.json
978
+ profile.schema.json
979
+ overlay.schema.json
980
+ core/
981
+ core-meta-rubric.yaml
982
+ scoring-guidance.md
983
+ profiles/
984
+ capability-map.yaml
985
+ solution-architecture.yaml
986
+ adr.yaml
987
+ roadmap.yaml
988
+ overlays/
989
+ security.yaml
990
+ regulatory.yaml
991
+ data-governance.yaml
992
+ cloud.yaml
993
+ calibration/
994
+ capability-map/
995
+ solution-architecture/
996
+ adr/
997
+ evaluations/
998
+ 2026/
999
+ ART-001/
1000
+ artifact/
1001
+ result.json
1002
+ report.md
1003
+ decision.md
1004
+
1005
+ ## 23. Machine-readable templates
1006
+
1007
+ ### 23.1 Rubric template
1008
+
1009
+ rubric\_id: EAROS-SOL-001
1010
+ title: Solution Architecture Profile
1011
+ version: 1.0.0
1012
+ status: approved
1013
+ owner: enterprise-architecture
1014
+ artifact\_type: solution\_architecture
1015
+ review\_purpose: decision\_review
1016
+ intended\_decision\_type:
1017
+
1018
+ - architecture\_board\_review
1019
+ - funding\_gate
1020
+ applicable\_stakeholders:
1021
+
1022
+ - architecture\_board
1023
+ - engineering
1024
+ - security
1025
+ - operations
1026
+ scale\_definition:
1027
+ type: ordinal\_0\_4\_plus\_na
1028
+ status\_rules:
1029
+ pass:
1030
+ min\_weighted\_score: 3.2
1031
+ no\_critical\_gate\_failures: true
1032
+ dimensions:
1033
+
1034
+ - id: stakeholder\_fit
1035
+ name: Stakeholder and purpose fit
1036
+ criteria:
1037
+
1038
+ - criterion\_id: STK-01
1039
+ name: Stakeholders, concerns, viewpoints
1040
+ question: Does the artifact explicitly identify stakeholders, concerns, and viewpoints?
1041
+ description: Checks whether the artifact is anchored in named stakeholders and their concerns.
1042
+ metric\_type: ordinal
1043
+ score\_scale: [0,1,2,3,4,"N/A"]
1044
+ weight: 1.0
1045
+ gate\_type: major
1046
+ required\_evidence:
1047
+
1048
+ - stakeholder\_list
1049
+ - concern\_list
1050
+ - viewpoint\_mapping
1051
+ scoring\_guidance:
1052
+ "0": absent or contradicted
1053
+ "1": implied only
1054
+ "2": explicit but incomplete
1055
+ "3": explicit and mostly complete
1056
+ "4": explicit, complete, and consistently used
1057
+ anti\_patterns:
1058
+
1059
+ - generic audience only
1060
+ - no concern-to-view mapping
1061
+ dependencies: []
1062
+ applicability\_rule: always
1063
+ rationale: Architecture review should be concern-driven.
1064
+
1065
+ ### 23.2 Evaluation template
1066
+
1067
+ artifact\_id: ART-2026-0042
1068
+ artifact\_type: solution\_architecture
1069
+ rubric\_id: EAROS-SOL-001
1070
+ rubric\_version: 1.0.0
1071
+ evaluated\_by:
1072
+
1073
+ - role: evaluator
1074
+ actor: agent
1075
+ - role: challenger
1076
+ actor: human
1077
+ evaluation\_mode: hybrid
1078
+ evaluation\_date: 2026-03-16
1079
+ criterion\_results:
1080
+
1081
+ - criterion\_id: STK-01
1082
+ score: 2
1083
+ confidence: medium
1084
+ evidence\_sufficiency: partial
1085
+ evidence\_refs:
1086
+
1087
+ - section: Audience
1088
+ - page: 3
1089
+ evidence\_class: observed
1090
+ rationale: Stakeholders are listed but concerns are not systematically mapped to views.
1091
+ evidence\_gaps:
1092
+
1093
+ - No explicit viewpoint model
1094
+ recommended\_actions:
1095
+
1096
+ - Add stakeholder-concern-view matrix
1097
+ dimension\_results:
1098
+
1099
+ - dimension\_id: stakeholder\_fit
1100
+ weighted\_score: 2.0
1101
+ gate\_failures: []
1102
+ overall\_status: conditional\_pass
1103
+ overall\_score: 2.8
1104
+ confidence: medium
1105
+ recommended\_actions:
1106
+
1107
+ - Add viewpoint mapping
1108
+ - Clarify decision scope
1109
+ decision\_summary: Usable with rework before final board review.
1110
+
1111
+ ## 24. Suggested minimum documentation for every new profile
1112
+
1113
+ Before a new profile is considered operational, it should have:
1114
+
1115
+ 1. the profile YAML/JSON definition
1116
+ 2. a profile guide in prose
1117
+ 3. at least 3 worked examples
1118
+ 4. at least 1 anti-example
1119
+ 5. a calibration record
1120
+ 6. an owner
1121
+ 7. a next review date
1122
+
1123
+ If one of those is missing, the profile is not yet mature enough for enterprise use.
1124
+
1125
+ ## 25. Recommended initial rollout
1126
+
1127
+ Do not start with ten profiles. Start with a controlled seed set.
1128
+
1129
+ Recommended first set:
1130
+
1131
+ * Core Meta-Rubric
1132
+ * Capability Map Profile
1133
+ * Solution Architecture Profile
1134
+ * ADR Profile
1135
+ * Security Overlay
1136
+ * Regulatory Overlay
1137
+
1138
+ Run these on real artifacts for 6–8 weeks, then tune before wider rollout.
1139
+
1140
+ ## 26. Anti-patterns to avoid
1141
+
1142
+ * one mega-rubric for every artifact
1143
+ * criteria with no evidence anchors
1144
+ * too many weighted criteria
1145
+ * hidden gates inside averages
1146
+ * policy and artifact-type concerns mixed without structure
1147
+ * no distinction between missing evidence and low quality
1148
+ * no calibration
1149
+ * profile sprawl
1150
+ * agent-only scoring with no challenger
1151
+ * no versioning of rubrics
1152
+ * reviewing artifacts without declaring purpose and decision context
1153
+
1154
+ ## 27. Practical review workflow
1155
+
1156
+ A practical enterprise workflow can be:
1157
+
1158
+ 1. select artifact type
1159
+ 2. resolve applicable profile
1160
+ 3. resolve applicable overlays
1161
+ 4. assemble composed rubric
1162
+ 5. extract evidence
1163
+ 6. apply scoring
1164
+ 7. challenge scoring
1165
+ 8. finalize status and actions
1166
+ 9. store machine-readable evaluation
1167
+ 10. store human-readable report
1168
+ 11. capture decision and waivers
1169
+ 12. feed results back into calibration
1170
+
1171
+ ## 28. Decision rules for choosing whether to add a new profile
1172
+
1173
+ Before approving a new profile, ask:
1174
+
1175
+ * Does this artifact type recur enough to justify standardization?
1176
+ * Is the core meta-rubric insufficient by itself?
1177
+ * Are the proposed additions stable across teams and time?
1178
+ * Would an overlay solve the problem better?
1179
+ * Can the profile be calibrated with real examples?
1180
+ * Does the profile improve decision quality enough to justify the governance overhead?
1181
+
1182
+ If the answer to several of these is “no”, do not add the profile.
1183
+
1184
+ ## 29. Summary standard statement
1185
+
1186
+ The enterprise standard should be:
1187
+
1188
+ All architecture rubrics shall be managed as governed, versioned, evidence-based evaluation assets composed from a common core meta-rubric, artifact-specific profiles, and context overlays. All scoring shall be explainable, evidence-linked, calibrated, and suitable for both human and agent-assisted application.
1189
+
1190
+ ## 33. Recommended next deliverables (updated for v2.0)
1191
+
1192
+ To operationalize this standard, create these next (additions for v2.0 marked with *):
1193
+
1194
+ 1. core-meta-rubric.yaml
1195
+ 2. solution-architecture.yaml
1196
+ 3. adr.yaml
1197
+ 4. capability-map.yaml
1198
+ 5. security.yaml
1199
+ 6. evaluation.schema.json
1200
+ 7. calibration-pack.md
1201
+ 8. review-report-template.md
1202
+
1203
+ ## 31. Artifact format requirements for AI assessability
1204
+
1205
+ This section defines requirements for architecture artifact structure and format that enable reliable automated assessment.
1206
+
1207
+ ### 31.1 Machine-readable metadata
1208
+
1209
+ Every architecture artifact should include a structured metadata block. The recommended format is YAML frontmatter at the beginning of the document:
1210
+
1211
+ * document\_type: the artifact type (e.g., solution\_architecture, capability\_map, adr)
1212
+ * version: semantic version of the artifact
1213
+ * status: draft, review, approved, deprecated
1214
+ * author: responsible person
1215
+ * last\_modified: ISO 8601 date
1216
+ * classification: information classification level
1217
+ * domain: business or technical domain
1218
+ * systems: list of systems in scope
1219
+ * standards\_compliance: list of applicable standards
1220
+ * quality\_attributes: key quality targets with measurable values
1221
+ * related\_artifacts: typed references to connected artifacts
1222
+ * assessment\_rubric: rubric identifier to apply
1223
+
1224
+ ### 31.2 Structured templates
1225
+
1226
+ Architecture documents should follow a consistent template structure to enable section-level extraction by AI agents. Recommended approaches include:
1227
+
1228
+ * **arc42** for software and solution architecture (12 standardised sections)
1229
+ * **MADR** (Markdown Architectural Decision Records) for decisions
1230
+ * **Decision Reasoning Format (DRF)** for machine-readable decision records in YAML/JSON
1231
+
1232
+ Custom templates are acceptable provided they include mandatory section markers and a documented section catalog.
1233
+
1234
+ ### 31.3 Diagram formats
1235
+
1236
+ Architecture diagrams should be stored in machine-readable formats wherever possible:
1237
+
1238
+ * ArchiMate Model Exchange File Format (XML/XSD) for enterprise architecture models
1239
+ * Structurizr DSL or PlantUML for C4 model diagrams
1240
+ * Mermaid or D2 for lightweight diagram-as-code
1241
+ * Dual representation: machine-readable model as authoritative source, rendered diagram as visual aid
1242
+
1243
+ Image-only diagrams should be accompanied by a structured element catalog listing components, relationships, and technology annotations.
1244
+
1245
+ ### 31.4 Section markers and tagging
1246
+
1247
+ Use semantic section markers to enable precise extraction by AI agents:
1248
+
1249
+ * Consistent heading hierarchy matching the template specification
1250
+ * Structured tables for traceability matrices and quality attribute scenarios
1251
+ * Explicit assumption, constraint, and risk markers
1252
+ * Embedded quality attribute scenarios in parseable format
1253
+
1254
+ ### 31.5 ArchiMate exchange format
1255
+
1256
+ For organisations using ArchiMate, the Open Group Model Exchange File Format provides a validated standard for machine-readable architecture models. The format supports three schema types (model exchange, view exchange, diagram exchange) and includes Dublin Core metadata. This format has been mandatory for certified ArchiMate tools since June 2018 and is directly amenable to automated validation and assessment.
1257
+
1258
+ ## 32. Rubric design principles for AI agent application
1259
+
1260
+ This section codifies empirically grounded principles for designing rubrics that AI agents can reliably apply.
1261
+
1262
+ ### 32.1 Decompose into atomic criteria
1263
+
1264
+ Break holistic quality assessments into the smallest possible independent dimensions. Each criterion should evaluate exactly one aspect. Research shows LLM alignment with human judges improved from 37.3% to 93.95% when rubrics were provided, with further improvements from atomic decomposition.
1265
+
1266
+ ### 32.2 Use analytic rather than holistic rubrics
1267
+
1268
+ Analytic rubrics (separate scores per dimension) consistently outperform holistic rubrics (single overall score) for AI-based evaluation. This is consistent with the EAROS core principle of separating observation from inference.
1269
+
1270
+ ### 32.3 Define precise level descriptors
1271
+
1272
+ Every point on every scale needs a written definition with concrete description of what that level looks like for the specific criterion. Without this, LLMs exhibit significant scoring drift across evaluation runs.
1273
+
1274
+ ### 32.4 Include positive and negative examples
1275
+
1276
+ Anchor criteria with exemplars of good and poor performance. Few-shot examples reduce ambiguity and improve consistency, particularly for subjective criteria. The EAROS anti-pattern fields in criterion definitions serve this purpose and should be populated for all criteria.
1277
+
1278
+ ### 32.5 Use decision trees for disambiguation
1279
+
1280
+ Structure ambiguous criteria as if/then decision trees rather than open-ended questions. This reduces the interpretive load on the agent and produces more consistent scoring.
1281
+
1282
+ ### 32.6 Separate assessment from aggregation
1283
+
1284
+ Score each dimension independently using dedicated prompts, then combine scores using a defined aggregation function. The multi-criteria decomposition approach produces more reliable and explainable results than end-to-end holistic scoring.
1285
+
1286
+ ### 32.7 Design for calibration
1287
+
1288
+ Every rubric should be designed with calibration in mind from the start. This means including boundary-case examples, documenting expected score distributions, and defining the inter-rater reliability targets that the rubric must meet before operational deployment.
1289
+
1290
+
1291
+ ## 34. Glossary
1292
+
1293
+ This glossary defines terms used throughout the EAROS standard. It is organized into three categories: statistical and calibration terms, EAROS-specific concepts, and architecture terms as used in this standard.
1294
+
1295
+ ---
1296
+
1297
+ ### 34.1 Statistical and Calibration Terms
1298
+
1299
+ **Calibration**
1300
+ The process of adjusting rubric criteria, level descriptors, and scoring guidance until different raters (human or AI) produce consistent results on the same artifacts. Calibration is considered complete when inter-rater reliability reaches the target threshold (κ > 0.70 for well-defined criteria). Calibration is a governance requirement in EAROS — no rubric may be used in a live review process without prior calibration.
1301
+
1302
+ **Cohen's kappa (κ)**
1303
+ A statistical measure of inter-rater reliability that corrects for chance agreement. Ranges from 0 (agreement no better than chance) to 1 (perfect agreement). Negative values indicate systematic disagreement. EAROS targets: κ > 0.70 for well-defined criteria, κ > 0.50 for criteria requiring subjective judgment. Named after Jacob Cohen (1960). See also: *weighted kappa*.
1304
+
1305
+ **Intraclass Correlation Coefficient (ICC)**
1306
+ A reliability measure for ordinal or continuous ratings when more than two raters assess the same artifact. Unlike Cohen's kappa (pairwise), ICC captures consistency across the full rater panel. Preferred when a calibration exercise involves three or more reviewers.
1307
+
1308
+ **Inter-rater reliability**
1309
+ The degree to which different raters — human reviewers or AI agents — produce consistent scores when independently evaluating the same artifact against the same rubric criteria. High inter-rater reliability indicates the rubric is unambiguous and the scoring guidance is effective. Measured using Cohen's κ, weighted κ, or ICC depending on the evaluation context.
1310
+
1311
+ **Spearman's rho (ρ)**
1312
+ A rank correlation coefficient measuring how well two sets of rankings agree. Unlike Pearson correlation, it makes no assumption about linear relationships between scores and is robust to outliers. EAROS target: ρ > 0.80 for overall assessment correlation between human and agent evaluators.
1313
+
1314
+ **Wasserstein distance**
1315
+ A metric that measures the difference between two probability distributions — specifically, the minimum "work" needed to transform one distribution into the other (also called the Earth Mover's Distance). Used in the RULERS calibration step to align AI agent score distributions with human reviewer distributions, correcting for systematic over- or under-scoring biases across the 0–4 scale.
1316
+
1317
+ **Weighted kappa**
1318
+ A variant of Cohen's kappa that treats disagreements between adjacent score levels (e.g., 2 vs. 3) as less severe than disagreements between distant levels (e.g., 1 vs. 4). More appropriate than unweighted kappa for ordinal scales where partial agreement is meaningful. Quadratic-weighted kappa (QWK) penalizes larger disagreements proportionally to the square of the distance. EAROS uses weighted kappa as its primary inter-rater reliability metric.
1319
+
1320
+ ---
1321
+
1322
+ ### 34.2 EAROS-Specific Terms
1323
+
1324
+ **Challenge pass**
1325
+ Step 6 of the DAG evaluation flow. A deliberate self-review in which the evaluator (human or agent) challenges their own highest and lowest scores, checking for weak evidence, over-scoring relative to the level descriptors, under-cited evidence, or evidence class errors. The challenge pass must be recorded in the evaluation record. It is not optional — skipping it invalidates the evaluation.
1326
+
1327
+ **Core meta-rubric**
1328
+ The universal foundation rubric (`core/core-meta-rubric.yaml`, rubric ID `EAROS-CORE-002`) that defines nine dimensions and ten criteria applying to every architecture artifact regardless of type. All profiles inherit from it. It establishes the scoring model, gate model, and output requirements that govern all EAROS evaluations.
1329
+
1330
+ **DAG evaluation flow**
1331
+ The eight-step Directed Acyclic Graph that structures agentic (and recommended human) evaluation: `structural_validation → content_extraction → criterion_scoring → cross_reference_validation → dimension_aggregation → challenge_pass → calibration → status_determination`. The DAG is sequential and must not be reordered. Each step produces outputs consumed by the next.
1332
+
1333
+ **Decision tree**
1334
+ An IF/THEN disambiguation logic block attached to each rubric criterion to help evaluators (especially AI agents) resolve ambiguous cases consistently. Decision trees reduce interpretive load by converting open-ended judgment into observable feature counts and branches. Example pattern: *"Count distinct views: IF < 2 THEN score 0-1. IF 2-3 views THEN score 2. IF 4+ views AND data flow narrative exists THEN score 3."*
1335
+
1336
+ **Evidence anchor**
1337
+ A specific, traceable reference to content in the artifact that supports a score assignment — for example, a section heading, page number, paragraph ID, or diagram label (e.g., "Section 3.2, deployment diagram on p. 8"). Required by the RULERS protocol for every score. Evidence anchors enable human reviewers to verify, challenge, or override any agent judgment.
1338
+
1339
+ **Evidence class**
1340
+ A mandatory classification of the type of evidence supporting a score:
1341
+ - `observed` — directly stated or shown in the artifact (highest credibility)
1342
+ - `inferred` — a reasonable interpretation not explicitly stated
1343
+ - `external` — judgment based on a standard, policy, or source outside the artifact (lowest credibility, must be declared)
1344
+
1345
+ Evidence class must be recorded for every criterion score in an evaluation record.
1346
+
1347
+ **Gate**
1348
+ A criterion-level control that can block a passing status regardless of the overall weighted average. Gates prevent weak scores on critical criteria from being diluted by strong scores elsewhere. Four gate types exist: `none` (no gate), `advisory` (recommendation triggered), `major` (status capped at `conditional_pass`), and `critical` (failure forces `reject` status). Gates are checked before averages are computed.
1349
+
1350
+ **Overlay**
1351
+ A cross-cutting concern extension that appends additional criteria on top of any core+profile combination. Overlays use `scoring.method: append_to_base_rubric` and `artifact_type: any` — they are not tied to a specific artifact type and are applied based on context (e.g., apply the security overlay whenever the artifact touches authentication or personal data). Overlays are additive: they cannot weaken or remove gates from the base rubric.
1352
+
1353
+ **Profile**
1354
+ An artifact-type-specific extension of the core meta-rubric. Profiles add 5–12 criteria for a particular artifact type (e.g., reference architecture adds 9 criteria across 6 dimensions). Every profile declares `inherits: [EAROS-CORE-002]` and is evaluated together with the core. The reference architecture profile (`EAROS-REFARCH-001`) is the reference implementation for how profiles should be constructed.
1355
+
1356
+ **RULERS protocol**
1357
+ **R**ubric **U**nification, **L**ocking, and **E**vidence-anchored **R**obust **S**coring. A framework for preventing LLM scoring drift through three mechanisms: (1) locked rubrics compiled into immutable specifications before evaluation, (2) mandatory evidence citation (evidence anchor per criterion), and (3) Wasserstein-based calibration to align agent score distributions with human reviewer distributions. Defined in arXiv 2601.08654 (January 2026) [R8]. All EAROS agent evaluations implement the RULERS protocol.
1358
+
1359
+ **Rubric locking**
1360
+ The practice of compiling rubrics into fixed, versioned specifications before evaluation begins and preventing any modification during an evaluation run (`rubric_locked: true` in agent metadata). Rubric locking prevents interpretation drift — the tendency for LLM judges to subtly reinterpret criteria across evaluation runs, undermining reproducibility. Changes to a locked rubric require a version bump and governance approval.
1361
+
1362
+ ---
1363
+
1364
+ ### 34.3 Architecture Terms (as used in EAROS)
1365
+
1366
+ **Architecture artifact**
1367
+ Any document, model, diagram, or structured record that describes, decides, or governs aspects of an enterprise architecture. Examples: solution architecture document, Architecture Decision Record, capability map, reference architecture, technology roadmap. Each artifact type has a matching EAROS profile that extends the core rubric for that artifact's specific evidence requirements.
1368
+
1369
+ **Architecture Decision Record (ADR)**
1370
+ A document that captures a single architectural decision along with its context, the options considered, the decision made, the rationale, the consequences, and triggers for revisiting the decision. ADRs are a first-class artifact type in EAROS with their own profile (`profiles/adr.yaml`).
1371
+
1372
+ **Concern**
1373
+ A specific interest, requirement, or question that a stakeholder has about the architecture — for example, "Can the system handle 10× traffic growth?" or "How does data leave the trust boundary?" Concerns are the fundamental driver of architecture evaluation in EAROS (Principle 1: concern-driven, not document-driven). A rubric criterion is essentially a concern formalized into a scorable question.
1374
+
1375
+ **Fitness function**
1376
+ An automated test, check, or metric that validates whether an architecture continues to meet a specific quality attribute target. Used in CI/CD pipelines for continuous architecture validation — for example, a latency test that fails if P99 response time exceeds 200ms. Fitness functions make quality attributes measurable and enforceable rather than aspirational.
1377
+
1378
+ **Golden path**
1379
+ An opinionated, fully supported implementation path for a reference architecture pattern. A golden path reduces setup time, ensures teams follow proven patterns, and provides pre-built integrations for the most common use cases. The EAROS reference architecture profile (`EAROS-REFARCH-001`) includes a dedicated criterion for golden path provision.
1380
+
1381
+ **Quality attribute**
1382
+ A measurable characteristic of a system or architecture that is distinct from its functional behaviour — for example, availability (target: 99.95%), latency (P99 < 200ms), or throughput (10,000 req/s). In EAROS, quality attributes must be quantified and scoped, not expressed as adjectives. "The system is highly available" does not constitute a quality attribute statement; "the system achieves 99.95% availability measured monthly" does.
1383
+
1384
+ **Quality attribute scenario**
1385
+ A structured description of a quality requirement using a six-part format: source (who or what generates the stimulus), stimulus (the event), artifact (the system element affected), environment (operating conditions), response (system behaviour), and response measure (quantified target). Derived from the TOGAF and SEI ATAM frameworks. Quality attribute scenarios are the required format for quality attribute claims in EAROS-evaluated artifacts.
1386
+
1387
+ **Viewpoint**
1388
+ A perspective from which an architecture is examined, defined by the concerns of a stakeholder group. Common viewpoints include: context (system boundary and external actors), functional (component decomposition), deployment (infrastructure and topology), data flow (information movement), and security (trust boundaries and controls). The EAROS reference architecture profile requires at least four viewpoints for a passing score.
1389
+
1390
+ ---
1391
+
1392
+ ## References
1393
+
1394
+ * **[R1] ISO/IEC/IEEE 42010 architecture description overview** — ISO Online Browsing Platform. Describes architecture description in terms of stakeholders, concerns, viewpoints, and model kinds.
1395
+ https://www.iso.org/obp/ui/en/
1396
+ * **[R2] Architecture compliance review overview** — The Open Group, “IT Architecture Compliance.” Describes architecture compliance review as scrutiny against established architectural criteria, spirit, and business objectives.
1397
+ https://www.opengroup.org/architecture/togaf7-doc/arch/p4/comp/comp.htm
1398
+ * **[R3] Architecture Tradeoff Analysis Method (ATAM)** — Carnegie Mellon SEI. Describes ATAM as a method for evaluating software architectures relative to quality attribute goals and exposing architectural risks and tradeoffs.
1399
+ https://www.sei.cmu.edu/library/architecture-tradeoff-analysis-method-collection/
1400
+ * **[R4] NIST AI RMF Playbook** — NIST AI Resource Center. Describes iterative suggested actions aligned to the AI RMF functions Govern, Map, Measure, and Manage and publishes structured formats such as JSON and CSV.
1401
+ https://airc.nist.gov/airmf-resources/playbook/
1402
+ * **[R5] ISO/IEC 25010:2023 product quality model** — ISO. Defines a product quality model whose characteristics and subcharacteristics provide a reference model for quality to be specified, measured, and evaluated.
1403
+ https://www.iso.org/standard/78176.html
1404
+ * **[R6] Interrater reliability and kappa** — McHugh, *Biochemia Medica* / PMC. Useful background on kappa as an inter-rater reliability statistic; weighted variants are appropriate when disagreement magnitude matters.
1405
+ https://pmc.ncbi.nlm.nih.gov/articles/PMC3900052/
1406
+ * **[R7] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts** — Microsoft Research, ACL 2024. Demonstrates that decomposed, dimension-specific scoring with learned aggregation achieves 2x improvement over uncalibrated LLM judging.
1407
+ https://aclanthology.org/2024.acl-long.745/
1408
+ * **[R8] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation** — arXiv 2601.08654, January 2026. Addresses prompt instability, unverifiable reasoning, and distributional misalignment through rubric locking and Wasserstein calibration.
1409
+ https://arxiv.org/abs/2601.08654
1410
+ * **[R9] The Science of Rubric Design** — Snorkel AI, 2025. Treats rubrics as models for goal alignment and inter-rater agreement, measured using Cohen’s kappa, QWK, and ICC.
1411
+ https://snorkel.ai/blog/the-science-of-rubric-design/
1412
+ * **[R10] Confusion-Aware Rubric Optimization (CARO)** — arXiv, March 2026. Systematically identifies and resolves rubric ambiguity through targeted rule patches.
1413
+ https://arxiv.org/html/2603.00451
1414
+ * **[R11] Accelerate AWS Well-Architected Reviews with Generative AI** — AWS Machine Learning Blog. Production-validated pattern for automated architecture assessment using Amazon Bedrock.
1415
+ https://aws.amazon.com/blogs/machine-learning/accelerate-aws-well-architected-reviews-with-generative-ai/
1416
+ * **[R12] A Systematic Literature Review of EA Evaluation Methods** — ACM Computing Surveys, 2024. Analyses 109 articles, concluding that automation is the critical adoption factor.
1417
+ https://dl.acm.org/doi/full/10.1145/3706582
1418
+ * **[R13] AutoRubric: A Unified Framework for Rubric-Based LLM Evaluation** — arXiv, 2026.
1419
+ https://arxiv.org/html/2603.00077v1
1420
+ * **[R14] ArchiMate Model Exchange File Format** — The Open Group. XML-based standard with validating XSD schemas for architecture model exchange.
1421
+ https://www.opengroup.org/open-group-archimate-model-exchange-file-format
1422
+ * **[R15] Demystifying Evals for AI Agents** — Anthropic Engineering Blog. Practical guidance on evaluation methodology for AI agent systems.
1423
+ https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
1424
+ * **[R16] AI for Software Architecture: Literature Review and Road Ahead** — arXiv, April 2025. Confirms that AI-assisted architecture evaluation remains largely manual with promising emerging capabilities.
1425
+ https://arxiv.org/html/2504.04334v1
1426
+