@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (135) hide show
  1. package/README.md +156 -0
  2. package/assets/init/.agents/skills/earos-artifact-gen/SKILL.md +106 -0
  3. package/assets/init/.agents/skills/earos-artifact-gen/references/interview-guide.md +313 -0
  4. package/assets/init/.agents/skills/earos-artifact-gen/references/output-guide.md +367 -0
  5. package/assets/init/.agents/skills/earos-assess/SKILL.md +212 -0
  6. package/assets/init/.agents/skills/earos-assess/references/calibration-benchmarks.md +160 -0
  7. package/assets/init/.agents/skills/earos-assess/references/output-templates.md +311 -0
  8. package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md +281 -0
  9. package/assets/init/.agents/skills/earos-calibrate/SKILL.md +153 -0
  10. package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md +188 -0
  11. package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md +263 -0
  12. package/assets/init/.agents/skills/earos-create/SKILL.md +257 -0
  13. package/assets/init/.agents/skills/earos-create/references/criterion-writing-guide.md +268 -0
  14. package/assets/init/.agents/skills/earos-create/references/dependency-rules.md +193 -0
  15. package/assets/init/.agents/skills/earos-create/references/rubric-interview-guide.md +123 -0
  16. package/assets/init/.agents/skills/earos-create/references/validation-checklist.md +238 -0
  17. package/assets/init/.agents/skills/earos-profile-author/SKILL.md +251 -0
  18. package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md +280 -0
  19. package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md +158 -0
  20. package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md +173 -0
  21. package/assets/init/.agents/skills/earos-remediate/SKILL.md +118 -0
  22. package/assets/init/.agents/skills/earos-remediate/references/output-template.md +199 -0
  23. package/assets/init/.agents/skills/earos-remediate/references/remediation-patterns.md +330 -0
  24. package/assets/init/.agents/skills/earos-report/SKILL.md +85 -0
  25. package/assets/init/.agents/skills/earos-report/references/portfolio-template.md +181 -0
  26. package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md +168 -0
  27. package/assets/init/.agents/skills/earos-review/SKILL.md +130 -0
  28. package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md +163 -0
  29. package/assets/init/.agents/skills/earos-review/references/output-template.md +180 -0
  30. package/assets/init/.agents/skills/earos-template-fill/SKILL.md +177 -0
  31. package/assets/init/.agents/skills/earos-template-fill/references/evidence-writing-guide.md +186 -0
  32. package/assets/init/.agents/skills/earos-template-fill/references/section-rubric-mapping.md +200 -0
  33. package/assets/init/.agents/skills/earos-validate/SKILL.md +113 -0
  34. package/assets/init/.agents/skills/earos-validate/references/fix-patterns.md +281 -0
  35. package/assets/init/.agents/skills/earos-validate/references/validation-checks.md +287 -0
  36. package/assets/init/.claude/CLAUDE.md +4 -0
  37. package/assets/init/AGENTS.md +293 -0
  38. package/assets/init/CLAUDE.md +635 -0
  39. package/assets/init/README.md +507 -0
  40. package/assets/init/calibration/gold-set/.gitkeep +0 -0
  41. package/assets/init/calibration/results/.gitkeep +0 -0
  42. package/assets/init/core/core-meta-rubric.yaml +643 -0
  43. package/assets/init/docs/consistency-report.md +325 -0
  44. package/assets/init/docs/getting-started.md +194 -0
  45. package/assets/init/docs/profile-authoring-guide.md +51 -0
  46. package/assets/init/docs/terminology.md +126 -0
  47. package/assets/init/earos.manifest.yaml +104 -0
  48. package/assets/init/evaluations/.gitkeep +0 -0
  49. package/assets/init/examples/aws-event-driven-order-processing/artifact.yaml +2056 -0
  50. package/assets/init/examples/aws-event-driven-order-processing/evaluation.yaml +973 -0
  51. package/assets/init/examples/aws-event-driven-order-processing/report.md +244 -0
  52. package/assets/init/examples/example-solution-architecture.evaluation.yaml +136 -0
  53. package/assets/init/examples/multi-cloud-data-analytics/artifact.yaml +715 -0
  54. package/assets/init/overlays/data-governance.yaml +94 -0
  55. package/assets/init/overlays/regulatory.yaml +154 -0
  56. package/assets/init/overlays/security.yaml +92 -0
  57. package/assets/init/profiles/adr.yaml +225 -0
  58. package/assets/init/profiles/capability-map.yaml +223 -0
  59. package/assets/init/profiles/reference-architecture.yaml +426 -0
  60. package/assets/init/profiles/roadmap.yaml +205 -0
  61. package/assets/init/profiles/solution-architecture.yaml +227 -0
  62. package/assets/init/research/architecture-assessment-rubrics-research.docx +0 -0
  63. package/assets/init/research/architecture-assessment-rubrics-research.md +566 -0
  64. package/assets/init/research/reference-architecture-research.md +751 -0
  65. package/assets/init/standard/EAROS.md +1426 -0
  66. package/assets/init/standard/schemas/artifact.schema.json +1295 -0
  67. package/assets/init/standard/schemas/artifact.uischema.json +65 -0
  68. package/assets/init/standard/schemas/evaluation.schema.json +284 -0
  69. package/assets/init/standard/schemas/rubric.schema.json +383 -0
  70. package/assets/init/templates/evaluation-record.template.yaml +58 -0
  71. package/assets/init/templates/new-profile.template.yaml +65 -0
  72. package/bin.js +188 -0
  73. package/dist/assets/_basePickBy-BVu6YmSW.js +1 -0
  74. package/dist/assets/_baseUniq-CWRzQDz_.js +1 -0
  75. package/dist/assets/arc-CyDBhtDM.js +1 -0
  76. package/dist/assets/architectureDiagram-2XIMDMQ5-BH6O4dvN.js +36 -0
  77. package/dist/assets/blockDiagram-WCTKOSBZ-2xmwdjpg.js +132 -0
  78. package/dist/assets/c4Diagram-IC4MRINW-BNmPRFJF.js +10 -0
  79. package/dist/assets/channel-CiySTNoJ.js +1 -0
  80. package/dist/assets/chunk-4BX2VUAB-DGQTvirp.js +1 -0
  81. package/dist/assets/chunk-55IACEB6-DNMAQAC_.js +1 -0
  82. package/dist/assets/chunk-FMBD7UC4-BJbVTQ5o.js +15 -0
  83. package/dist/assets/chunk-JSJVCQXG-BCxUL74A.js +1 -0
  84. package/dist/assets/chunk-KX2RTZJC-H7wWZOfz.js +1 -0
  85. package/dist/assets/chunk-NQ4KR5QH-BK4RlTQF.js +220 -0
  86. package/dist/assets/chunk-QZHKN3VN-0chxDV5g.js +1 -0
  87. package/dist/assets/chunk-WL4C6EOR-DexfQ-AV.js +189 -0
  88. package/dist/assets/classDiagram-VBA2DB6C-D7luWJQn.js +1 -0
  89. package/dist/assets/classDiagram-v2-RAHNMMFH-D7luWJQn.js +1 -0
  90. package/dist/assets/clone-ylgRbd3D.js +1 -0
  91. package/dist/assets/cose-bilkent-S5V4N54A-DS2IOCfZ.js +1 -0
  92. package/dist/assets/cytoscape.esm-CyJtwmzi.js +331 -0
  93. package/dist/assets/dagre-KLK3FWXG-BbSoTTa3.js +4 -0
  94. package/dist/assets/defaultLocale-DX6XiGOO.js +1 -0
  95. package/dist/assets/diagram-E7M64L7V-C9TvYgv0.js +24 -0
  96. package/dist/assets/diagram-IFDJBPK2-DowUMWrg.js +43 -0
  97. package/dist/assets/diagram-P4PSJMXO-BL6nrnQF.js +24 -0
  98. package/dist/assets/erDiagram-INFDFZHY-rXPRl8VM.js +70 -0
  99. package/dist/assets/flowDiagram-PKNHOUZH-DBRM99-W.js +162 -0
  100. package/dist/assets/ganttDiagram-A5KZAMGK-INcWFsBT.js +292 -0
  101. package/dist/assets/gitGraphDiagram-K3NZZRJ6-DMwpfE91.js +65 -0
  102. package/dist/assets/graph-DLQn37b-.js +1 -0
  103. package/dist/assets/index-BFFITMT8.js +650 -0
  104. package/dist/assets/index-H7f6VTz1.css +1 -0
  105. package/dist/assets/infoDiagram-LFFYTUFH-B0f4TWRM.js +2 -0
  106. package/dist/assets/init-Gi6I4Gst.js +1 -0
  107. package/dist/assets/ishikawaDiagram-PHBUUO56-CsU6XimZ.js +70 -0
  108. package/dist/assets/journeyDiagram-4ABVD52K-CQ7ibNib.js +139 -0
  109. package/dist/assets/kanban-definition-K7BYSVSG-DzEN7THt.js +89 -0
  110. package/dist/assets/katex-B1X10hvy.js +261 -0
  111. package/dist/assets/layout-C0dvb42R.js +1 -0
  112. package/dist/assets/linear-j4a8mGj7.js +1 -0
  113. package/dist/assets/mindmap-definition-YRQLILUH-DP8iEuCf.js +68 -0
  114. package/dist/assets/ordinal-Cboi1Yqb.js +1 -0
  115. package/dist/assets/pieDiagram-SKSYHLDU-BpIAXgAm.js +30 -0
  116. package/dist/assets/quadrantDiagram-337W2JSQ-DrpXn5Eg.js +7 -0
  117. package/dist/assets/requirementDiagram-Z7DCOOCP-Bg7EwHlG.js +73 -0
  118. package/dist/assets/sankeyDiagram-WA2Y5GQK-BWagRs1F.js +10 -0
  119. package/dist/assets/sequenceDiagram-2WXFIKYE-q5jwhivG.js +145 -0
  120. package/dist/assets/stateDiagram-RAJIS63D-B_J9pE-2.js +1 -0
  121. package/dist/assets/stateDiagram-v2-FVOUBMTO-Q_1GcybB.js +1 -0
  122. package/dist/assets/timeline-definition-YZTLITO2-dv0jgQ0z.js +61 -0
  123. package/dist/assets/treemap-KZPCXAKY-Dt1dkIE7.js +162 -0
  124. package/dist/assets/vennDiagram-LZ73GAT5-BdO5RgRZ.js +34 -0
  125. package/dist/assets/xychartDiagram-JWTSCODW-CpDVe-8v.js +7 -0
  126. package/dist/index.html +23 -0
  127. package/export-docx.js +1583 -0
  128. package/init.js +353 -0
  129. package/manifest-cli.mjs +207 -0
  130. package/package.json +83 -0
  131. package/schemas/artifact.schema.json +1295 -0
  132. package/schemas/artifact.uischema.json +65 -0
  133. package/schemas/evaluation.schema.json +284 -0
  134. package/schemas/rubric.schema.json +383 -0
  135. package/serve.js +238 -0
@@ -0,0 +1,160 @@
1
+ # Calibration Benchmarks — EAROS Assessment
2
+
3
+ This file supports Step 7 (calibration check). Use it to sanity-check your score distribution before finalizing. The question it answers: "Does my score distribution feel right given what I read?"
4
+
5
+ ---
6
+
7
+ ## What Score Distributions Look Like in Practice
8
+
9
+ Architecture artifacts cluster into recognizable patterns. Matching your scores to these patterns is a fast calibration check.
10
+
11
+ ### Strong Artifact (Overall ~3.2–3.8)
12
+
13
+ A strong artifact is rare. It earns high scores across most dimensions through specific, evidence-rich content — not fluent writing.
14
+
15
+ Characteristics:
16
+ - Stakeholders named with specific concerns mapped to sections
17
+ - Scope defined with an explicit in/out list and stated assumptions
18
+ - Business drivers traceable to specific design decisions (not just listed in an intro)
19
+ - Key architectural decisions have context, options, rationale, and revisit triggers
20
+ - Risks are specific, owned, mitigated, or explicitly accepted with rationale
21
+ - Compliance treatment is concrete: named controls, named exceptions, named owners
22
+ - The artifact could be handed to an independent delivery team and acted on
23
+
24
+ **Score distribution pattern:**
25
+ - Most criteria: 3 or 4
26
+ - Some criteria: 2 (minor gaps acceptable)
27
+ - Zero criteria below 2 (a single score < 2 on a critical dimension drops status to Conditional Pass)
28
+ - Gate criteria: all above threshold
29
+
30
+ **Warning sign:** If you are scoring most criteria 4, stop and challenge. Very few real-world artifacts achieve score 4 across the board. Score 4 requires explicit, consistent, and complete treatment — not just "the section exists."
31
+
32
+ ---
33
+
34
+ ### Adequate Artifact (Overall ~2.4–3.1)
35
+
36
+ The most common type in practice. Directionally sound but with material gaps. Governance boards can conditionally approve these.
37
+
38
+ Characteristics:
39
+ - Stakeholders listed but concerns not mapped
40
+ - Scope exists but assumptions or exclusions are incomplete
41
+ - Drivers mentioned in introduction but not connected to specific decisions
42
+ - Key decisions have rationale but alternatives not discussed
43
+ - Risks listed without mitigations or owners
44
+ - Compliance mentioned without specific control mappings
45
+
46
+ **Score distribution pattern:**
47
+ - Some criteria: 3 (the sections that are done well)
48
+ - Most criteria: 2 (sections that exist but are incomplete)
49
+ - A few criteria: 1 or 0 (notably absent areas)
50
+ - Status: Conditional Pass or Rework Required depending on which criteria are low
51
+
52
+ ---
53
+
54
+ ### Weak Artifact (Overall ~1.2–2.3)
55
+
56
+ Significant work needed before governance review. These should return to the author.
57
+
58
+ Characteristics:
59
+ - Audience or purpose unclear or generic
60
+ - Scope not defined or so broad as to be useless
61
+ - Drivers listed in bullets with no connection to architecture choices
62
+ - Decisions presented as facts with no rationale
63
+ - Risks entirely absent or "TBD"
64
+ - Compliance by assertion ("the solution will comply with all applicable standards")
65
+ - Many diagrams with no explanatory narrative
66
+
67
+ **Score distribution pattern:**
68
+ - Most criteria: 1 or 2
69
+ - Several criteria: 0
70
+ - Likely gate failures on major criteria
71
+ - Status: Rework Required
72
+
73
+ ---
74
+
75
+ ### Not Reviewable Artifact
76
+
77
+ Some artifacts are so incomplete that meaningful scoring is impossible.
78
+
79
+ Characteristics:
80
+ - No scope or purpose
81
+ - Single diagram with no narrative
82
+ - Author but no owner, no version, no date
83
+ - Gate criteria (especially SCP-01 with `severity: critical`) have `evidence_class: none`
84
+
85
+ When this applies: stop at Step 1. Flag Not Reviewable. Explain what must be added before assessment can proceed.
86
+
87
+ ---
88
+
89
+ ## Self-Calibration Questions
90
+
91
+ Ask these before finalizing your scores:
92
+
93
+ ### 1. Am I rewarding presence or quality?
94
+
95
+ Many artifacts have sections that technically exist but are content-free. A "Risks" section that says "Risks: TBD" scores 0, not 2. A "Stakeholders" section that says "Audience: technical stakeholders" scores 1, not 3.
96
+
97
+ Ask: "Is this section actually useful to a reviewer, or does it just tick a box?"
98
+
99
+ ### 2. Am I scoring fluency as architecture quality?
100
+
101
+ Well-written artifacts can be architecturally weak. Dense, technical artifacts can be architecturally strong. The score should reflect the content, not the prose quality.
102
+
103
+ Ask: "If I stripped the adjectives and presented only the facts, what would the score be?"
104
+
105
+ ### 3. Am I applying external knowledge to the score?
106
+
107
+ It is legitimate to apply external knowledge as `evidence_class: external`, but you must classify it as such. If you know that the architectural pattern chosen is technically sound, that knowledge is external to the artifact. The artifact still needs to justify the choice.
108
+
109
+ Ask: "Am I filling in gaps with my own knowledge and crediting the artifact for it?"
110
+
111
+ ### 4. Does my distribution make sense?
112
+
113
+ Run a quick gut check:
114
+ - Average above 3.5: Is this truly an exceptional, decision-ready artifact? Double-check the 3 highest scores.
115
+ - Average below 1.5: Is this truly near-unusable? Check whether you've over-applied gate failures.
116
+ - All scores the same: Almost certainly wrong — artifacts have uneven coverage. Look for criteria you may have scored from impression.
117
+ - Many N/A scores: Each N/A requires justification. If you have more than 3 N/A scores, verify they are genuinely inapplicable, not just hard to evidence.
118
+
119
+ ### 5. Are my confidence levels honest?
120
+
121
+ - `high`: You found a direct quote and the score level mapping is unambiguous
122
+ - `medium`: You found evidence but interpretation required some judgment
123
+ - `low`: Evidence is thin, ambiguous, or heavily inferred
124
+
125
+ If most of your criteria are `confidence: high`, that may indicate you are not applying enough scrutiny. Real artifacts produce genuine ambiguity on several criteria.
126
+
127
+ ---
128
+
129
+ ## When to Flag for Human Review
130
+
131
+ Flag these situations explicitly in the evaluation output:
132
+
133
+ **Flag 1: Conflicting evidence**
134
+ You found evidence that suggests two different scores. Example: Section 2 claims PCI compliance, but Section 6 shows a design element that violates PCI DSS requirements. Note both pieces of evidence and score conservatively (the lower score). A human reviewer should adjudicate.
135
+
136
+ **Flag 2: Heavily inferred or external evidence**
137
+ If your score depends substantially on `inferred` or `external` evidence — especially on gate criteria — note this explicitly. The human reviewer should determine whether the inference is warranted.
138
+
139
+ **Flag 3: Low-confidence gate criteria**
140
+ If any gate criterion has `confidence: low`, the gate status (pass or fail) is uncertain. Flag this clearly. Gate failures with low-confidence evidence should not automatically trigger Reject without human confirmation.
141
+
142
+ **Flag 4: Score outliers that don't fit the pattern**
143
+ If one dimension scores much higher or lower than the rest, and you don't have a clear evidence-based explanation, flag it. This may indicate a mis-application of the rubric.
144
+
145
+ **Flag 5: Artifact type uncertainty**
146
+ If you are unsure whether to apply a profile (e.g., "is this a solution architecture or a reference architecture?"), flag the uncertainty and explain what the score would be under each interpretation.
147
+
148
+ ---
149
+
150
+ ## Inter-Rater Reliability Context
151
+
152
+ EAROS targets Cohen's κ > 0.70 (substantial agreement) for well-defined criteria. In practice:
153
+
154
+ - **Criteria with clear observable conditions** (like SCP-01 or MNT-01) typically achieve κ > 0.75 between trained reviewers
155
+ - **Criteria requiring judgment** (like RAT-01 trade-off quality, or ACT-01 actionability) typically achieve κ 0.55–0.70
156
+ - **Criteria evaluating depth** (like CMP-01 compliance treatment) are hardest — target κ > 0.50
157
+
158
+ What this means for agent evaluation: your scores on judgment-heavy criteria should be treated as informed starting points, not definitive judgments. Flag these explicitly for human review.
159
+
160
+ The goal of calibration benchmarks is not to constrain your scores to expected distributions — it is to help you notice when you may have drifted from evidence-based scoring. Use these as a self-check, not a formula.
@@ -0,0 +1,311 @@
1
+ # Output Templates — EAROS Assessment
2
+
3
+ This file contains the full templates for EAROS evaluation outputs. Read this before producing output in Step 8.
4
+
5
+ The canonical schema is `standard/schemas/evaluation.schema.json`. The worked example is `examples/example-solution-architecture.evaluation.yaml`. When in doubt, mirror the worked example.
6
+
7
+ ---
8
+
9
+ ## Output 1 — YAML Evaluation Record
10
+
11
+ The YAML record is the machine-readable, archivable output. It is the authoritative record of the evaluation — the markdown report is derived from it.
12
+
13
+ ### Full Template
14
+
15
+ ```yaml
16
+ evaluation_id: EVAL-[TYPE]-[NNNN]
17
+ # Format: EVAL-SOL-0001 (solution), EVAL-REF-0001 (reference arch), EVAL-ADR-0001, etc.
18
+ # Use a sequential number within the artifact type.
19
+
20
+ rubric_id: [rubric IDs used, comma-separated]
21
+ # Example: EAROS-CORE-002, EAROS-REFARCH-001
22
+ # If overlays applied: EAROS-CORE-002, EAROS-SOL-001, EAROS-OVR-SEC-001
23
+
24
+ rubric_version: [version of the profile used]
25
+ # Example: 2.0.0
26
+
27
+ artifact_ref:
28
+ id: [artifact identifier if one exists, or omit]
29
+ title: [full title of the artifact as it appears in the document]
30
+ artifact_type: [solution_architecture | reference_architecture | adr | capability_map | roadmap]
31
+ owner: [team or individual named as owner in the artifact]
32
+ uri: [repo path, URL, or file path — omit if not available]
33
+
34
+ evaluation_date: '[YYYY-MM-DD]'
35
+
36
+ evaluators:
37
+ - name: EAROS evaluator
38
+ role: rubric-evaluator
39
+ mode: agent
40
+ # If human also evaluated, add:
41
+ # - name: [name]
42
+ # role: domain architect
43
+ # mode: human
44
+
45
+ status: [pass | conditional_pass | rework_required | reject | not_reviewable]
46
+
47
+ overall_score: [weighted average to 1 decimal place — e.g. 2.8]
48
+ # Compute: sum(dimension_score × weight) / sum(weights)
49
+ # Exclude N/A criteria from all calculations
50
+
51
+ gate_failures:
52
+ - criterion_id: [ID]
53
+ severity: [critical | major]
54
+ effect: [what the gate failure causes]
55
+ # If no gate failures: gate_failures: []
56
+
57
+ criterion_results:
58
+ - criterion_id: [ID]
59
+ # Use IDs from the rubric YAML, e.g. STK-01, SCP-01, TRC-01
60
+ score: [0 | 1 | 2 | 3 | 4 | "N/A"]
61
+ judgment_type: [observed | inferred | external | mixed | none]
62
+ # 'mixed' when evidence combines observed and inferred
63
+ confidence: [high | medium | low]
64
+ evidence_sufficiency: [sufficient | partial | absent]
65
+ # sufficient: evidence supports the score without reservation
66
+ # partial: evidence exists but is incomplete or ambiguous
67
+ # absent: no evidence found; score is 0 or N/A
68
+ evidence_refs:
69
+ - location: "[section heading, page number, or diagram label]"
70
+ excerpt: "[direct quote or very close paraphrase from the artifact]"
71
+ # Add more refs if multiple evidence sources support the score
72
+ rationale: >
73
+ [1-3 sentences explaining why the evidence maps to this score level.
74
+ Cite the specific evidence. Explain why it is not one level higher
75
+ if the score is below 4.]
76
+ missing_information:
77
+ - "[specific piece of information that would improve this score]"
78
+ # Leave empty if score is 4 or N/A
79
+ recommended_actions:
80
+ - "[specific, actionable remediation step — verb-first, e.g. 'Add a stakeholder-concern table']"
81
+ revised: false
82
+ # Set to true if this score was revised during the challenge pass (Step 6)
83
+
84
+ # Repeat for every criterion in core + profile + overlays
85
+
86
+ dimension_scores:
87
+ - dimension_id: [D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9 | RA-D1 | etc.]
88
+ dimension_name: [name from rubric]
89
+ score: [weighted average of criteria in this dimension, 1 decimal place]
90
+ weight: [weight from rubric YAML — default 1.0]
91
+ summary: "[1 sentence summary of why this dimension scored this way]"
92
+
93
+ # Repeat for every dimension in core + profile
94
+
95
+ narrative_summary: |
96
+ [2-3 paragraphs for a governance reviewer. Cover:
97
+ 1. What the artifact is, who it is for, and the overall verdict
98
+ 2. The most significant strengths (what was well done)
99
+ 3. The most significant weaknesses (what holds it back)
100
+ Do NOT restate the criterion scores — synthesize them into a judgment.]
101
+
102
+ summary:
103
+ strengths:
104
+ - "[Key strength — specific, not generic]"
105
+ - "[Second strength]"
106
+ weaknesses:
107
+ - "[Key weakness — specific, not generic]"
108
+ - "[Second weakness]"
109
+ risks:
110
+ - "[Risk that follows from a weakness — what could go wrong in delivery/governance]"
111
+ next_actions:
112
+ - "[Top-priority action]"
113
+ - "[Second priority action]"
114
+ decision_narrative: >
115
+ [1-2 sentences on what happens next — should this go to governance board as-is,
116
+ conditional on specific fixes, or returned for rework?]
117
+
118
+ recommended_actions:
119
+ - priority: 1
120
+ criterion_id: [ID of the criterion this addresses]
121
+ action: "[Specific, actionable step — verb-first]"
122
+ owner_suggestion: "[Who should own this — team role, not individual]"
123
+ - priority: 2
124
+ criterion_id: [ID]
125
+ action: "[Action]"
126
+ owner_suggestion: "[Role]"
127
+ # Top 5 actions, ordered by impact on overall status
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Field-by-Field Guidance
133
+
134
+ ### evaluation_id
135
+
136
+ Use a sequential ID within the artifact type. If you don't have a numbering system, use the date: `EVAL-SOL-20260319-001`. The ID must be unique within the organization's evaluation records.
137
+
138
+ ### status
139
+
140
+ The status is determined by gates first, then thresholds (Step 8 in SKILL.md). Do not set status until all gate checks and aggregation are complete. Common error: setting `conditional_pass` when a critical gate has failed — critical gate failure always = `reject`.
141
+
142
+ ### overall_score
143
+
144
+ This is the weighted average across all dimensions. The formula:
145
+ ```
146
+ overall_score = sum(dimension_score × dimension_weight) / sum(dimension_weights)
147
+ ```
148
+
149
+ Round to 1 decimal place. A score of 2.35 rounds to 2.4, which is the `conditional_pass` threshold — be precise.
150
+
151
+ ### judgment_type
152
+
153
+ This is the evidence class for the criterion as a whole. If all evidence is `observed`, use `observed`. If you used a mix of observed and inferred evidence to reach the score, use `mixed`. `none` means no evidence was found (score must be 0 or 1).
154
+
155
+ ### evidence_sufficiency
156
+
157
+ This is your assessment of whether the evidence you found is adequate to confidently assign the score:
158
+ - `sufficient` — the evidence clearly matches one level; you wouldn't expect a reviewer to disagree
159
+ - `partial` — evidence exists but is ambiguous; a different reviewer might score differently
160
+ - `absent` — no evidence was found; score is based on absence
161
+
162
+ ### narrative_summary
163
+
164
+ This is the most important text in the record for human reviewers. Write it for a governance board member who will skim the criterion table but read the narrative carefully. The narrative should:
165
+ - Name what the artifact is and its governance context
166
+ - Identify the 2-3 things that most determine the outcome
167
+ - Give a clear recommendation (proceed, fix X first, rework)
168
+
169
+ Avoid repeating individual scores. Synthesize.
170
+
171
+ ---
172
+
173
+ ## Output 2 — Markdown Report
174
+
175
+ The markdown report is the human-readable deliverable. It should be self-contained — a governance reviewer should be able to read it without the YAML.
176
+
177
+ ### Full Template
178
+
179
+ ```markdown
180
+ # EAROS Assessment Report
181
+
182
+ **Artifact:** [full title from artifact]
183
+ **Artifact Type:** [type]
184
+ **Owner:** [owner from artifact]
185
+ **Evaluation Date:** [YYYY-MM-DD]
186
+ **Rubric:** [rubric IDs used]
187
+
188
+ ---
189
+
190
+ ## Overall Status
191
+
192
+ **Status:** [traffic-light emoji + label]
193
+ - 🟢 Pass
194
+ - 🟡 Conditional Pass
195
+ - 🟠 Rework Required
196
+ - 🔴 Reject
197
+ - ⚪ Not Reviewable
198
+
199
+ **Overall Score:** [X.X / 4.0]
200
+
201
+ [1 sentence verdict: what the status means for this artifact]
202
+
203
+ ---
204
+
205
+ ## Dimension Summary
206
+
207
+ | Dimension | Score | Status | Key Finding |
208
+ |-----------|-------|--------|-------------|
209
+ | [Dimension name] | [X.X] | [🟢/🟡/🟠/🔴] | [1-phrase summary] |
210
+ | [Dimension name] | [X.X] | [🟢/🟡/🟠/🔴] | [1-phrase summary] |
211
+
212
+ Traffic light per dimension:
213
+ - 🟢 ≥ 3.2
214
+ - 🟡 2.4–3.19
215
+ - 🟠 2.0–2.39
216
+ - 🔴 < 2.0
217
+
218
+ ---
219
+
220
+ ## Gate Failures
221
+
222
+ [If none:]
223
+ No gate failures. The artifact passes all gate checks.
224
+
225
+ [If failures:]
226
+ | Criterion | Severity | Effect |
227
+ |-----------|----------|--------|
228
+ | [criterion_id] | [CRITICAL / MAJOR] | [what it causes] |
229
+
230
+ ---
231
+
232
+ ## Key Findings
233
+
234
+ **Strengths:**
235
+ - [Specific strength with evidence reference]
236
+ - [Second strength]
237
+
238
+ **Weaknesses:**
239
+ - [Specific weakness with criterion reference]
240
+ - [Second weakness]
241
+
242
+ **Risks:**
243
+ - [Risk that follows from a weakness]
244
+
245
+ ---
246
+
247
+ ## Recommended Actions
248
+
249
+ | Priority | Criterion | Action | Suggested Owner |
250
+ |----------|-----------|--------|-----------------|
251
+ | 1 | [ID] | [Specific action] | [Role] |
252
+ | 2 | [ID] | [Specific action] | [Role] |
253
+ | 3 | [ID] | [Specific action] | [Role] |
254
+ | 4 | [ID] | [Specific action] | [Role] |
255
+ | 5 | [ID] | [Specific action] | [Role] |
256
+
257
+ ---
258
+
259
+ ## Criterion Detail
260
+
261
+ [For each criterion, in dimension order:]
262
+
263
+ ### [Dimension Name] (Score: X.X)
264
+
265
+ #### [criterion_id]: [criterion question]
266
+
267
+ | Field | Value |
268
+ |-------|-------|
269
+ | Score | [0-4 or N/A] |
270
+ | Evidence Class | [observed / inferred / external / none] |
271
+ | Confidence | [high / medium / low] |
272
+ | Evidence Sufficiency | [sufficient / partial / absent] |
273
+
274
+ **Evidence:** [Section/location] — "[Direct quote or close paraphrase]"
275
+
276
+ **Rationale:** [1-3 sentences explaining the score]
277
+
278
+ **Missing:** [What would improve this score, or "Nothing — criterion fully satisfied"]
279
+
280
+ **Action:** [Specific remediation step, or "No action required"]
281
+
282
+ ---
283
+
284
+ [Continue for all criteria]
285
+
286
+ ---
287
+
288
+ ## Narrative Summary
289
+
290
+ [2-3 paragraphs — synthesized judgment for a governance reviewer.
291
+ Copy from the YAML narrative_summary field.]
292
+
293
+ ---
294
+
295
+ *Evaluated using EAROS [rubric IDs]. Schema: evaluation.schema.json.*
296
+ ```
297
+
298
+ ---
299
+
300
+ ## Validation
301
+
302
+ Before submitting the YAML evaluation record, check:
303
+
304
+ 1. Every criterion in the loaded rubric files has a result entry
305
+ 2. Every score has at least one `evidence_refs` entry (unless `evidence_class: none`)
306
+ 3. `gate_failures` matches the gate criteria that failed (not just any low score)
307
+ 4. `overall_score` is the weighted average, not a simple average
308
+ 5. `status` was determined by gates first, then thresholds
309
+ 6. The `narrative_summary` does not just list criterion scores — it synthesizes them
310
+
311
+ The full JSON Schema for validation is at `standard/schemas/evaluation.schema.json`. If you have access to a YAML validator, validate the output before delivery.