@trohde/earos 1.1.2 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/assets/init/docs/getting-started.md +1 -1
  2. package/assets/init/docs/onboarding/agent-assisted.md +19 -19
  3. package/assets/init/docs/onboarding/first-assessment.md +18 -18
  4. package/assets/init/docs/onboarding/governed-review.md +10 -10
  5. package/assets/init/docs/onboarding/overview.md +15 -15
  6. package/assets/init/docs/onboarding/scaling-optimization.md +13 -13
  7. package/assets/init/docs/plans/2026-03-23-001-refactor-site-review-findings-plan.md +195 -0
  8. package/assets/init/docs/plans/2026-03-23-002-refactor-cli-review-findings-plan.md +736 -0
  9. package/assets/init/docs/profile-authoring-guide.md +5 -9
  10. package/assets/init/docs/terminology.md +1 -1
  11. package/bin.js +156 -36
  12. package/dist/assets/{_basePickBy-PmSUrUsK.js → _basePickBy-BlC_TeV6.js} +1 -1
  13. package/dist/assets/{_baseUniq-HuZouVIz.js → _baseUniq-CVy7rcC1.js} +1 -1
  14. package/dist/assets/{arc-CJFxtF3d.js → arc-Cd8wvd7z.js} +1 -1
  15. package/dist/assets/{architectureDiagram-2XIMDMQ5-XA-oU2UG.js → architectureDiagram-2XIMDMQ5-D_f4_aMp.js} +1 -1
  16. package/dist/assets/{blockDiagram-WCTKOSBZ-Oxp-wAST.js → blockDiagram-WCTKOSBZ-B-y6N5--.js} +1 -1
  17. package/dist/assets/{c4Diagram-IC4MRINW-D8m5hQH9.js → c4Diagram-IC4MRINW-C3-v3oNT.js} +1 -1
  18. package/dist/assets/channel-BSC0F15G.js +1 -0
  19. package/dist/assets/{chunk-4BX2VUAB-D2kBTn2O.js → chunk-4BX2VUAB-CMPwQN83.js} +1 -1
  20. package/dist/assets/{chunk-55IACEB6-Dxqrf5oZ.js → chunk-55IACEB6-Bdkfhvrr.js} +1 -1
  21. package/dist/assets/{chunk-FMBD7UC4-DoOEFFQC.js → chunk-FMBD7UC4-ptKQX5uF.js} +1 -1
  22. package/dist/assets/{chunk-JSJVCQXG-BerphV2K.js → chunk-JSJVCQXG-DO0UU_OX.js} +1 -1
  23. package/dist/assets/{chunk-KX2RTZJC-CxUAqT05.js → chunk-KX2RTZJC-DRj2OZnD.js} +1 -1
  24. package/dist/assets/{chunk-NQ4KR5QH-fCqZgFkU.js → chunk-NQ4KR5QH-C4Nsf7ww.js} +1 -1
  25. package/dist/assets/{chunk-QZHKN3VN-HlpHnJEy.js → chunk-QZHKN3VN-B1GO0Nwy.js} +1 -1
  26. package/dist/assets/{chunk-WL4C6EOR-D9yxAHyd.js → chunk-WL4C6EOR-lFR6fjR8.js} +1 -1
  27. package/dist/assets/classDiagram-VBA2DB6C-BHDWMOEz.js +1 -0
  28. package/dist/assets/classDiagram-v2-RAHNMMFH-BHDWMOEz.js +1 -0
  29. package/dist/assets/clone-BdN-3iAD.js +1 -0
  30. package/dist/assets/{cose-bilkent-S5V4N54A-F5xOBvqW.js → cose-bilkent-S5V4N54A-IpR9mVIO.js} +1 -1
  31. package/dist/assets/{dagre-KLK3FWXG-CD3BTpHv.js → dagre-KLK3FWXG-B4YA6T7N.js} +1 -1
  32. package/dist/assets/{diagram-E7M64L7V-C3D9MCay.js → diagram-E7M64L7V-Do5l6es_.js} +1 -1
  33. package/dist/assets/{diagram-IFDJBPK2-zJBVM-GK.js → diagram-IFDJBPK2-D5MxfKVv.js} +1 -1
  34. package/dist/assets/{diagram-P4PSJMXO-BrmFZOLB.js → diagram-P4PSJMXO-Djr28EgW.js} +1 -1
  35. package/dist/assets/{erDiagram-INFDFZHY-aSMhKiV2.js → erDiagram-INFDFZHY-BuM-rbCL.js} +1 -1
  36. package/dist/assets/{flowDiagram-PKNHOUZH-DwgX7l8F.js → flowDiagram-PKNHOUZH-By3WGI7Q.js} +1 -1
  37. package/dist/assets/{ganttDiagram-A5KZAMGK-C57Hz6QW.js → ganttDiagram-A5KZAMGK-GLmBfK72.js} +1 -1
  38. package/dist/assets/{gitGraphDiagram-K3NZZRJ6-CuchqqGh.js → gitGraphDiagram-K3NZZRJ6-BN0iXeIv.js} +1 -1
  39. package/dist/assets/{graph-CPFGBV5J.js → graph-CDzuMtjV.js} +1 -1
  40. package/dist/assets/{index-DMt1cpG6.js → index-DoeSN_Oe.js} +130 -130
  41. package/dist/assets/{infoDiagram-LFFYTUFH-Dd_5tfX7.js → infoDiagram-LFFYTUFH-C888gaFw.js} +1 -1
  42. package/dist/assets/{ishikawaDiagram-PHBUUO56-DwosSEvT.js → ishikawaDiagram-PHBUUO56-ChIO9DG-.js} +1 -1
  43. package/dist/assets/{journeyDiagram-4ABVD52K-BuCxcsX0.js → journeyDiagram-4ABVD52K-CufMUDcs.js} +1 -1
  44. package/dist/assets/{kanban-definition-K7BYSVSG-DF_1UCkW.js → kanban-definition-K7BYSVSG-BpsSVpX8.js} +1 -1
  45. package/dist/assets/{layout-DIcS6m1g.js → layout-B8RWVBSF.js} +1 -1
  46. package/dist/assets/{linear-BXkwBhoJ.js → linear-BJwxtq9r.js} +1 -1
  47. package/dist/assets/{mindmap-definition-YRQLILUH-DcDvYagd.js → mindmap-definition-YRQLILUH-C6WPimbf.js} +1 -1
  48. package/dist/assets/{pieDiagram-SKSYHLDU-BmeDeWDM.js → pieDiagram-SKSYHLDU-DeCGMWf8.js} +1 -1
  49. package/dist/assets/{quadrantDiagram-337W2JSQ-3zfjULUM.js → quadrantDiagram-337W2JSQ-D9TWaS83.js} +1 -1
  50. package/dist/assets/{requirementDiagram-Z7DCOOCP-B2wQMJpq.js → requirementDiagram-Z7DCOOCP-DTnuXlAq.js} +1 -1
  51. package/dist/assets/{sankeyDiagram-WA2Y5GQK-__kKlCTq.js → sankeyDiagram-WA2Y5GQK-B2dplCgD.js} +1 -1
  52. package/dist/assets/{sequenceDiagram-2WXFIKYE-B7O81Vih.js → sequenceDiagram-2WXFIKYE-cBvgSSju.js} +1 -1
  53. package/dist/assets/{stateDiagram-RAJIS63D-CcJaDrAK.js → stateDiagram-RAJIS63D-Cwr7VtSX.js} +1 -1
  54. package/dist/assets/stateDiagram-v2-FVOUBMTO-B59h7VTZ.js +1 -0
  55. package/dist/assets/{timeline-definition-YZTLITO2-DSaQQqIU.js → timeline-definition-YZTLITO2-Dkp163fK.js} +1 -1
  56. package/dist/assets/{treemap-KZPCXAKY-9Hcrd8XD.js → treemap-KZPCXAKY-BUWHa5xU.js} +1 -1
  57. package/dist/assets/{vennDiagram-LZ73GAT5-BqHNyca2.js → vennDiagram-LZ73GAT5-BihD66ma.js} +1 -1
  58. package/dist/assets/{xychartDiagram-JWTSCODW-BqeYf6Fk.js → xychartDiagram-JWTSCODW-Cw4lPbuZ.js} +1 -1
  59. package/dist/index.html +1 -1
  60. package/export-docx.js +12 -4
  61. package/init.js +19 -14
  62. package/manifest-cli.mjs +32 -3
  63. package/package.json +3 -2
  64. package/serve.js +44 -19
  65. package/utils/export-markdown.js +486 -0
  66. package/dist/assets/channel-SoktpVBQ.js +0 -1
  67. package/dist/assets/classDiagram-VBA2DB6C-BT2AdZTe.js +0 -1
  68. package/dist/assets/classDiagram-v2-RAHNMMFH-BT2AdZTe.js +0 -1
  69. package/dist/assets/clone-DOjIfi5r.js +0 -1
  70. package/dist/assets/stateDiagram-v2-FVOUBMTO-B2goOPt-.js +0 -1
@@ -24,7 +24,7 @@ EaROS has profiles for the most common enterprise architecture artifact types:
24
24
  | Architecture Decision Record (ADR) | `profiles/adr.yaml` | Approved |
25
25
  | Capability map | `profiles/capability-map.yaml` | Approved |
26
26
  | Architecture roadmap | `profiles/roadmap.yaml` | Draft |
27
- | Other / unknown | Core only: `core/core-meta-rubric.yaml` | --- |
27
+ | Other / unknown | Core only: `core/core-meta-rubric.yaml` | |
28
28
 
29
29
  > **Status:** *Approved* profiles have completed calibration. *Draft* profiles are usable but have not yet been calibrated with inter-rater reliability measured. Check `earos.manifest.yaml` for the latest status of each rubric.
30
30
 
@@ -2,7 +2,7 @@
2
2
 
3
3
  > **Level 3 to 4: Governed to Hybrid**
4
4
 
5
- Your team produces governed, calibrated evaluations. Now you bring AI agents into the process --- not to replace human reviewers, but to provide an independent second perspective that strengthens every assessment.
5
+ Your team produces governed, calibrated evaluations. Now you bring AI agents into the process not to replace human reviewers, but to provide an independent second perspective that strengthens every assessment.
6
6
 
7
7
  ## What Changes at This Level
8
8
 
@@ -14,23 +14,23 @@ Agent evaluations follow an 8-step directed acyclic graph (DAG). Each step must
14
14
 
15
15
  ### The 8-Step DAG Evaluation Flow
16
16
 
17
- **Step 1 --- Structural Validation.** The agent confirms the artifact conforms to its declared type. Does it have the expected sections? Is it machine-readable or does it require OCR? Can the agent identify the artifact's scope and purpose?
17
+ **Step 1 Structural Validation.** The agent confirms the artifact conforms to its declared type. Does it have the expected sections? Is it machine-readable or does it require OCR? Can the agent identify the artifact's scope and purpose?
18
18
 
19
- **Step 2 --- Content Extraction.** The agent identifies sections, diagrams, traceability elements, and key content areas. This builds a map of the artifact's structure before scoring begins.
19
+ **Step 2 Content Extraction.** The agent identifies sections, diagrams, traceability elements, and key content areas. This builds a map of the artifact's structure before scoring begins.
20
20
 
21
- **Step 3 --- Criterion Scoring.** The agent applies the RULERS protocol to each criterion: extract a direct quote or reference from the artifact as evidence, then match it against the `scoring_guide` level descriptors to assign a 0--4 score. If no evidence can be found, score N/A and explain why.
21
+ **Step 3 Criterion Scoring.** The agent applies the RULERS protocol to each criterion: extract a direct quote or reference from the artifact as evidence, then match it against the `scoring_guide` level descriptors to assign a 04 score. If no evidence can be found, score N/A and explain why.
22
22
 
23
- **Step 4 --- Cross-Reference Validation.** The agent checks consistency across views: do component names match across diagrams? Do interface definitions agree between the API contract and the sequence diagram? Are there contradictions between sections?
23
+ **Step 4 Cross-Reference Validation.** The agent checks consistency across views: do component names match across diagrams? Do interface definitions agree between the API contract and the sequence diagram? Are there contradictions between sections?
24
24
 
25
- **Step 5 --- Dimension Aggregation.** The agent computes weighted dimension averages using the dimension weights defined in the rubric.
25
+ **Step 5 Dimension Aggregation.** The agent computes weighted dimension averages using the dimension weights defined in the rubric.
26
26
 
27
- **Step 6 --- Challenge Pass.** A second perspective (another agent instance or a human) challenges the evaluator's highest and lowest scores. Are the highest scores supported by strong observed evidence, or are they inflated? Are the lowest scores genuinely that weak, or did the evaluator miss relevant content?
27
+ **Step 6 Challenge Pass.** A second perspective (another agent instance or a human) challenges the evaluator's highest and lowest scores. Are the highest scores supported by strong observed evidence, or are they inflated? Are the lowest scores genuinely that weak, or did the evaluator miss relevant content?
28
28
 
29
- **Step 7 --- Calibration.** The agent aligns its score distribution to reference human distributions using the Wasserstein-based method (`rulers_wasserstein`). This prevents systematic over-scoring or under-scoring relative to human reviewers.
29
+ **Step 7 Calibration.** The agent aligns its score distribution to reference human distributions using the Wasserstein-based method (`rulers_wasserstein`). This prevents systematic over-scoring or under-scoring relative to human reviewers.
30
30
 
31
- **Step 8 --- Status Determination.** Gates are checked first (a critical gate failure blocks a passing status --- the specific outcome, `Reject` or `Not Reviewable`, is determined by the criterion's `failure_effect`), then the weighted average is computed and applied against the status thresholds.
31
+ **Step 8 Status Determination.** Gates are checked first (a critical gate failure blocks a passing status the specific outcome, `Reject` or `Not Reviewable`, is determined by the criterion's `failure_effect`), then the weighted average is computed and applied against the status thresholds.
32
32
 
33
- > **The DAG is not optional.** Skipping steps --- particularly the challenge pass (Step 6) --- undermines evaluation quality. An agent evaluation without a challenge pass is an unchecked evaluation.
33
+ > **The DAG is not optional.** Skipping steps particularly the challenge pass (Step 6) undermines evaluation quality. An agent evaluation without a challenge pass is an unchecked evaluation.
34
34
 
35
35
  ## Setting Up Agent Evaluation
36
36
 
@@ -47,9 +47,9 @@ The workspace includes 10 EAROS skills. The three most relevant for agent evalua
47
47
 
48
48
  | Skill | Purpose |
49
49
  |-------|---------|
50
- | `earos-assess` | Primary evaluation --- runs the full 8-step DAG on any artifact |
51
- | `earos-review` | Challenger --- audits an existing evaluation for over-scoring and unsupported claims |
52
- | `earos-template-fill` | Author guide --- coaches artifact authors through writing assessment-ready documents |
50
+ | `earos-assess` | Primary evaluation runs the full 8-step DAG on any artifact |
51
+ | `earos-review` | Challenger audits an existing evaluation for over-scoring and unsupported claims |
52
+ | `earos-template-fill` | Author guide coaches artifact authors through writing assessment-ready documents |
53
53
 
54
54
  ### With Other AI Agents
55
55
 
@@ -65,7 +65,7 @@ To run an agent assessment, provide the artifact and invoke the `earos-assess` s
65
65
  4. Execute the full 8-step DAG
66
66
  5. Produce an evaluation record conforming to `evaluation.schema.json`
67
67
 
68
- The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable --- you can trace each one back to the evidence that supports it.
68
+ The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable you can trace each one back to the evidence that supports it.
69
69
 
70
70
  ## The Hybrid Model
71
71
 
@@ -75,7 +75,7 @@ The hybrid model is the defining practice of Level 4. Here is how it works:
75
75
 
76
76
  2. **Score comparison.** After both evaluations are complete, compare results criterion by criterion.
77
77
 
78
- 3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference --- go back to the `scoring_guide` level descriptors and determine which score more accurately reflects the evidence.
78
+ 3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference go back to the `scoring_guide` level descriptors and determine which score more accurately reflects the evidence.
79
79
 
80
80
  4. **Reconciled record.** The final evaluation record captures both evaluators (mode: human and mode: agent) and notes where reconciliation occurred.
81
81
 
@@ -87,7 +87,7 @@ The hybrid model is the defining practice of Level 4. Here is how it works:
87
87
 
88
88
  ## The Challenge Pass
89
89
 
90
- Step 6 of the DAG --- the challenge pass --- deserves special attention because it is the most commonly skipped step and the most valuable.
90
+ Step 6 of the DAG the challenge pass deserves special attention because it is the most commonly skipped step and the most valuable.
91
91
 
92
92
  In the challenge pass, a second perspective reviews the evaluation and specifically targets:
93
93
 
@@ -104,11 +104,11 @@ At Level 4, you begin tracking quantitative metrics for evaluation quality:
104
104
  | Metric | Target | What It Tells You |
105
105
  |--------|--------|------------------|
106
106
  | **Cohen's kappa** (human-agent) | > 0.70 | Agreement between human and agent after calibration |
107
- | **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation --- do human and agent agree on which criteria are strong vs. weak? |
107
+ | **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation do human and agent agree on which criteria are strong vs. weak? |
108
108
  | **Gate failure rate** | Track trend | How often critical or major gates fail, and for which criteria |
109
109
  | **Score distribution** | Compare over time | Are scores clustering (suggesting rubber-stamping) or well-distributed? |
110
110
 
111
- Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift --- time to re-calibrate.
111
+ Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift time to re-calibrate.
112
112
 
113
113
  ## Checkpoint: You Are at Level 4 When...
114
114
 
@@ -116,7 +116,7 @@ Track these metrics per rubric, per team, and over time. A declining kappa sugge
116
116
  - [ ] Every agent evaluation includes a challenge pass (Step 6)
117
117
  - [ ] Human-agent disagreements of 2 or more points are routinely resolved against level descriptors
118
118
  - [ ] You track inter-rater reliability metrics (kappa and/or Spearman's rho)
119
- - [ ] Agent evaluations are auditable --- evidence anchors, evidence classes, and confidence are captured for every score
119
+ - [ ] Agent evaluations are auditable evidence anchors, evidence classes, and confidence are captured for every score
120
120
 
121
121
  ## Next Steps
122
122
 
@@ -8,14 +8,14 @@ This guide walks you from zero to your first scored architecture evaluation. By
8
8
 
9
9
  - How to install the EAROS CLI and initialize a workspace
10
10
  - The 9 dimensions and 10 criteria of the core meta-rubric
11
- - How the 0--4 scoring scale works in practice
11
+ - How the 04 scoring scale works in practice
12
12
  - How to find evidence in an artifact, cite it, and assign a score
13
13
  - How gates work and when they override the average
14
14
  - How to interpret your evaluation result
15
15
 
16
16
  ## Prerequisites
17
17
 
18
- You need one thing: **an architecture artifact to assess**. This can be any document your organization produces --- a solution design, an ADR, a capability map, a reference architecture, a roadmap. It does not need to be perfect; in fact, a flawed artifact is more instructive for a first assessment.
18
+ You need one thing: **an architecture artifact to assess**. This can be any document your organization produces a solution design, an ADR, a capability map, a reference architecture, a roadmap. It does not need to be perfect; in fact, a flawed artifact is more instructive for a first assessment.
19
19
 
20
20
  ## Installing the CLI
21
21
 
@@ -31,7 +31,7 @@ Then initialize a workspace in your project directory:
31
31
  earos init my-workspace
32
32
  ```
33
33
 
34
- This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained --- everything you need is scaffolded into the directory.
34
+ This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained everything you need is scaffolded into the directory.
35
35
 
36
36
  ## Understanding the Workspace
37
37
 
@@ -39,11 +39,11 @@ This creates a complete EAROS workspace with rubric files, JSON schemas, agent s
39
39
 
40
40
  The `earos init` command creates a structured directory containing:
41
41
 
42
- - **Rubric files** --- the core meta-rubric and all built-in profiles and overlays (YAML)
43
- - **JSON schemas** --- for validating rubrics, evaluation records, and artifact documents
44
- - **Agent skills** --- 10 pre-configured skills for AI-assisted evaluation (in `.agents/skills/`)
45
- - **AGENTS.md** --- agent-agnostic instructions for AI tools like Cursor, Copilot, and Windsurf
46
- - **Manifest** --- an inventory of all available rubrics
42
+ - **Rubric files** the core meta-rubric and all built-in profiles and overlays (YAML)
43
+ - **JSON schemas** for validating rubrics, evaluation records, and artifact documents
44
+ - **Agent skills** 10 pre-configured skills for AI-assisted evaluation (in `.agents/skills/`)
45
+ - **AGENTS.md** agent-agnostic instructions for AI tools like Cursor, Copilot, and Windsurf
46
+ - **Manifest** an inventory of all available rubrics
47
47
 
48
48
  ## The Core Meta-Rubric
49
49
 
@@ -61,9 +61,9 @@ The core meta-rubric (`EAROS-CORE-002`) is the universal foundation. It applies
61
61
  | **D8: Actionability and implementation relevance** | Can a delivery team act on this artifact without significant guesswork? |
62
62
  | **D9: Artifact maintainability and stewardship** | Is the artifact versioned, owned, and structured so it can be maintained over time? |
63
63
 
64
- For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact --- 10 criteria is manageable for a first pass.
64
+ For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact 10 criteria is manageable for a first pass.
65
65
 
66
- ## The 0--4 Scoring Scale
66
+ ## The 04 Scoring Scale
67
67
 
68
68
  Every criterion uses the same ordinal scale:
69
69
 
@@ -90,23 +90,23 @@ Follow these steps for each of the 10 criteria:
90
90
 
91
91
  3. **Record the evidence reference.** Write down where you found it: section number, page, diagram ID, or a direct quote. "Section 3 states: 'Primary stakeholders are the CTO and Head of Payments'" is valid evidence. "The artifact seems to address this" is not.
92
92
 
93
- 4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure --- it provides IF/THEN logic for resolving ambiguous cases.
93
+ 4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure it provides IF/THEN logic for resolving ambiguous cases.
94
94
 
95
95
  5. **Move to the next criterion.** Repeat for all 10 criteria.
96
96
 
97
97
  ## Understanding Gates
98
98
 
99
- Not all criteria are equal. Some have **gates** --- threshold controls that can block a passing status regardless of how well you score on everything else.
99
+ Not all criteria are equal. Some have **gates** threshold controls that can block a passing status regardless of how well you score on everything else.
100
100
 
101
101
  | Gate Severity | What Happens |
102
102
  |---------------|-------------|
103
- | **Critical** | If the score is below the gate threshold, the artifact is blocked from passing. The `failure_effect` determines the outcome --- typically **Reject**, or **Not Reviewable** when evidence is too incomplete to score. |
103
+ | **Critical** | If the score is below the gate threshold, the artifact is blocked from passing. The `failure_effect` determines the outcome typically **Reject**, or **Not Reviewable** when evidence is too incomplete to score. |
104
104
  | **Major** | A low score caps the maximum achievable status (e.g., cannot pass above Conditional Pass). |
105
105
  | **Advisory** | A low score triggers a recommendation but does not block any status. |
106
106
 
107
107
  In the core rubric, two criteria have **critical** gates: **SCP-01** (Scope and boundary clarity) — if the scope is so unclear that the artifact cannot be reviewed (score < 2), the result is "Not Reviewable" regardless of all other scores; and **CMP-01** (Standards and policy compliance) — if mandatory control compliance cannot be determined, the result is "Reject". **STK-01** (Stakeholder and purpose fit) and **TRC-01** (Traceability) have **major** gates.
108
108
 
109
- > **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop --- the result is determined by the gate's `failure_effect` (Reject or Not Reviewable). Only then compute the weighted average for the remaining status thresholds.
109
+ > **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop the result is determined by the gate's `failure_effect` (Reject or Not Reviewable). Only then compute the weighted average for the remaining status thresholds.
110
110
 
111
111
  ## Interpreting Your Results
112
112
 
@@ -115,25 +115,25 @@ After scoring all criteria and checking gates, compute the weighted average acro
115
115
  | Status | Threshold |
116
116
  |--------|-----------|
117
117
  | **Pass** | No critical gate failure, overall average >= 3.2, and no dimension average < 2.0 |
118
- | **Conditional Pass** | No critical gate failure, overall average 2.4--3.19 (weaknesses are containable with named actions) |
118
+ | **Conditional Pass** | No critical gate failure, overall average 2.43.19 (weaknesses are containable with named actions) |
119
119
  | **Rework Required** | Overall average < 2.4, or repeated weak dimensions, or insufficient evidence |
120
120
  | **Reject** | Any critical gate failure, or mandatory control breach |
121
121
  | **Not Reviewable** | Evidence too incomplete to score responsibly |
122
122
 
123
123
  ![Scoring a criterion — the editor shows the question, scoring guide, evidence fields, and assigned score](/screenshots/editor-evaluation-result.png)
124
124
 
125
- A Conditional Pass is not a failure --- it means the artifact is close but needs specific, named improvements before it is decision-ready. Record those improvements as actions in the evaluation record.
125
+ A Conditional Pass is not a failure it means the artifact is close but needs specific, named improvements before it is decision-ready. Record those improvements as actions in the evaluation record.
126
126
 
127
127
  ## Checkpoint: You Are at Level 2 When...
128
128
 
129
129
  - [ ] You have completed at least one assessment using the core meta-rubric
130
- - [ ] Every score has a cited evidence reference --- not "seems adequate" but a specific section, page, or quote
130
+ - [ ] Every score has a cited evidence reference not "seems adequate" but a specific section, page, or quote
131
131
  - [ ] You can explain the difference between a score of 2 and a score of 3 for any criterion
132
132
  - [ ] You understand which gates would block a Pass status and why
133
133
  - [ ] Your evaluation result includes a status determination (Pass, Conditional Pass, Rework Required, Reject, or Not Reviewable)
134
134
 
135
135
  ## Next Steps
136
136
 
137
- You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process --- with artifact-specific profiles, cross-cutting overlays, and calibrated scoring.
137
+ You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process with artifact-specific profiles, cross-cutting overlays, and calibrated scoring.
138
138
 
139
139
  Continue to [Governed Review](governed-review.md).
@@ -2,7 +2,7 @@
2
2
 
3
3
  > **Level 2 to 3: Rubric-Based to Governed**
4
4
 
5
- You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice --- with artifact-specific profiles, context-driven overlays, calibrated teams, and evidence-anchored scoring that is reproducible across your organization.
5
+ You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice with artifact-specific profiles, context-driven overlays, calibrated teams, and evidence-anchored scoring that is reproducible across your organization.
6
6
 
7
7
  ## What Changes at This Level
8
8
 
@@ -16,7 +16,7 @@ At Level 2, you used the core rubric and produced evidence-backed scores. At Lev
16
16
 
17
17
  ## Choosing a Profile
18
18
 
19
- The core meta-rubric's 10 criteria are universal --- they apply to every architecture artifact. But a reference architecture has different quality expectations than an ADR, and a capability map is evaluated differently than a roadmap. Profiles add artifact-specific criteria on top of the core --- typically 3 to 9, depending on the artifact type.
19
+ The core meta-rubric's 10 criteria are universal they apply to every architecture artifact. But a reference architecture has different quality expectations than an ADR, and a capability map is evaluated differently than a roadmap. Profiles add artifact-specific criteria on top of the core typically 3 to 9, depending on the artifact type.
20
20
 
21
21
  | Profile | Artifact Type | Status | What It Adds |
22
22
  |---------|--------------|--------|-------------|
@@ -28,7 +28,7 @@ The core meta-rubric's 10 criteria are universal --- they apply to every archite
28
28
 
29
29
  > **Status key:** *Approved* profiles have been calibrated and are ready for governed use. *Draft* profiles are usable but have not completed the full calibration process (see [Calibrating with Your Team](#calibrating-with-your-team)).
30
30
 
31
- Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria --- 13--19 criteria total depending on the profile.
31
+ Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria 1319 criteria total depending on the profile.
32
32
 
33
33
  > **How to choose:** Match the profile to the artifact's declared type. If the artifact does not fit any built-in profile, use the core rubric alone. Creating custom profiles is covered in [Scaling and Optimization](scaling-optimization.md).
34
34
 
@@ -44,7 +44,7 @@ Overlays inject cross-cutting concerns that apply across artifact types. Unlike
44
44
  | **Data Governance** (`data-governance.yaml`) | Approved | The artifact describes data flows, data retention, data classification, or data lineage |
45
45
  | **Regulatory** (`regulatory.yaml`) | Draft | The artifact operates in a regulated domain: payments, healthcare, financial reporting, privacy |
46
46
 
47
- Overlays are additive --- they append criteria to the base rubric (core + profile). They cannot remove or weaken gates from the base. An overlay's critical gate adds to the gate model; it does not replace it.
47
+ Overlays are additive they append criteria to the base rubric (core + profile). They cannot remove or weaken gates from the base. An overlay's critical gate adds to the gate model; it does not replace it.
48
48
 
49
49
  You can apply multiple overlays simultaneously. A payments solution architecture might use the solution-architecture profile with both the security and regulatory overlays.
50
50
 
@@ -58,7 +58,7 @@ For each criterion:
58
58
 
59
59
  1. Search the artifact for content that addresses the criterion
60
60
  2. If you find it, record the evidence anchor: a direct quote, section reference, or diagram ID
61
- 3. Then --- and only then --- match the evidence against the `scoring_guide` level descriptors
61
+ 3. Then and only then match the evidence against the `scoring_guide` level descriptors
62
62
  4. If you cannot find evidence, record N/A and explain why the criterion does not apply, or score 0 and note the absence
63
63
 
64
64
  Never score from impression. "The artifact seems to address security" is not evidence. "Section 7.2 states: 'All inter-service communication uses mTLS with certificates rotated every 90 days'" is evidence.
@@ -73,7 +73,7 @@ Every piece of evidence you cite must be classified:
73
73
  | **Inferred** | A reasonable interpretation of content that is not directly stated | Medium |
74
74
  | **External** | Judgment based on a standard, policy, or source outside the artifact | Lowest |
75
75
 
76
- Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps --- which is itself a finding worth recording.
76
+ Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps which is itself a finding worth recording.
77
77
 
78
78
  ## The Three Evaluation Types
79
79
 
@@ -87,7 +87,7 @@ EAROS distinguishes three distinct judgment types that should not be collapsed i
87
87
 
88
88
  These are related but distinct. A beautifully written, complete document can describe an architecturally unsound system. A technically excellent architecture can be documented in an unmaintainable artifact. Collapsing these into one score hides critical information.
89
89
 
90
- In practice, EAROS criteria map to these three types through the dimension structure --- the narrative summary in the evaluation record should address all three perspectives. The rubric's criterion scores provide the evidence base; the narrative synthesizes them into these three distinct judgments.
90
+ In practice, EAROS criteria map to these three types through the dimension structure the narrative summary in the evaluation record should address all three perspectives. The rubric's criterion scores provide the evidence base; the narrative synthesizes them into these three distinct judgments.
91
91
 
92
92
  ![The rubric editor with file sidebar showing core, profiles, and overlays — the building blocks of governed review](/screenshots/editor-rubric-criteria.png)
93
93
 
@@ -97,7 +97,7 @@ Calibration is what transforms individual scoring into a team capability. Withou
97
97
 
98
98
  ### Step-by-step calibration exercise
99
99
 
100
- 1. **Select 3--5 representative artifacts.** Aim for diversity: one strong artifact, one weak, one ambiguous, and one incomplete. The gold-standard example at `examples/aws-event-driven-order-processing/` is an excellent starting point.
100
+ 1. **Select 35 representative artifacts.** Aim for diversity: one strong artifact, one weak, one ambiguous, and one incomplete. The gold-standard example at `examples/aws-event-driven-order-processing/` is an excellent starting point.
101
101
 
102
102
  2. **Have 2+ reviewers score independently.** Each reviewer scores the same artifact against the same rubric without discussing their scores.
103
103
 
@@ -125,7 +125,7 @@ For more on evaluation record structure, see the [Getting Started guide](../gett
125
125
  ## Checkpoint: You Are at Level 3 When...
126
126
 
127
127
  - [ ] Your team uses a matching profile (not just the core rubric) for every assessment
128
- - [ ] Every score uses the RULERS protocol --- evidence anchor first, then score
128
+ - [ ] Every score uses the RULERS protocol evidence anchor first, then score
129
129
  - [ ] You have completed a calibration exercise with kappa > 0.70
130
130
  - [ ] Overlays are applied based on context (not arbitrarily or never)
131
131
  - [ ] Evaluation records are structured and conform to `evaluation.schema.json`
@@ -133,6 +133,6 @@ For more on evaluation record structure, see the [Getting Started guide](../gett
133
133
 
134
134
  ## Next Steps
135
135
 
136
- Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process --- not to replace human judgment, but to augment it with a second independent perspective.
136
+ Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process not to replace human judgment, but to augment it with a second independent perspective.
137
137
 
138
138
  Continue to [Agent-Assisted Evaluation](agent-assisted.md).
@@ -2,12 +2,12 @@
2
2
 
3
3
  ## Why Staged Adoption Matters
4
4
 
5
- Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem --- it is a capability maturity problem.
5
+ Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem it is a capability maturity problem.
6
6
 
7
7
  The EAROS Adoption Maturity Model draws on three decades of maturity research:
8
8
 
9
9
  - **CMMI** (Capability Maturity Model Integration) established the 5-level progression from initial/ad hoc to optimizing, demonstrating that process maturity is built incrementally.
10
- - **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement --- not tooling alone.
10
+ - **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement not tooling alone.
11
11
  - **OMB EAAF** (Enterprise Architecture Assessment Framework) showed that federal agencies succeed when they build capability in stages aligned to organizational readiness.
12
12
  - **TOGAF ACMM** (Architecture Capability Maturity Model) provided the architecture-specific framing: maturity grows from informal practices through defined processes to measured and optimized operations.
13
13
 
@@ -15,25 +15,25 @@ EAROS applies these lessons to a specific domain: architecture artifact evaluati
15
15
 
16
16
  ## The Five Levels
17
17
 
18
- ### Level 1 --- Ad Hoc
18
+ ### Level 1 Ad Hoc
19
19
 
20
20
  No formal review process. Evaluation quality depends entirely on who happens to review the artifact. Different reviewers apply different mental models, and feedback is inconsistent and unreproducible.
21
21
 
22
22
  - **Key practices:** Informal peer review, tribal knowledge
23
23
  - **EAROS capabilities:** None (this is the baseline state)
24
- - **You are here when:** You recognize the problem --- reviews are inconsistent and reviewer-dependent
24
+ - **You are here when:** You recognize the problem reviews are inconsistent and reviewer-dependent
25
25
 
26
- ### Level 2 --- Rubric-Based
26
+ ### Level 2 Rubric-Based
27
27
 
28
- The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 0--4 scoring scale. Evidence is cited for every score. Results are reproducible across reviewers.
28
+ The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 04 scoring scale. Evidence is cited for every score. Results are reproducible across reviewers.
29
29
 
30
30
  - **Key practices:** Manual scoring against core meta-rubric, evidence citation for every score, gate checking
31
- - **EAROS capabilities:** Core meta-rubric, scoring sheets, 0--4 scale
31
+ - **EAROS capabilities:** Core meta-rubric, scoring sheets, 04 scale
32
32
  - **You are here when:** You have completed at least one assessment using the core rubric with evidence for every score
33
33
 
34
34
  > **Guide:** [Your First Assessment](first-assessment.md) walks you through this transition.
35
35
 
36
- ### Level 3 --- Governed
36
+ ### Level 3 Governed
37
37
 
38
38
  Artifact-specific profiles and context-driven overlays are in use. Teams are calibrated against reference examples. The RULERS protocol ensures evidence-anchored scoring. Evaluation records are structured and auditable.
39
39
 
@@ -43,7 +43,7 @@ Artifact-specific profiles and context-driven overlays are in use. Teams are cal
43
43
 
44
44
  > **Guide:** [Governed Review](governed-review.md) walks you through this transition.
45
45
 
46
- ### Level 4 --- Hybrid
46
+ ### Level 4 Hybrid
47
47
 
48
48
  AI agents augment human reviewers. Both evaluate independently and reconcile disagreements against level descriptors. Metrics track inter-rater reliability between human and agent evaluators.
49
49
 
@@ -53,7 +53,7 @@ AI agents augment human reviewers. Both evaluate independently and reconcile dis
53
53
 
54
54
  > **Guide:** [Agent-Assisted Evaluation](agent-assisted.md) walks you through this transition.
55
55
 
56
- ### Level 5 --- Optimized
56
+ ### Level 5 Optimized
57
57
 
58
58
  Architecture evaluation is continuous and integrated into delivery workflows. Calibration happens automatically. Executive reporting provides portfolio-level quality visibility. Rubrics are governed assets with version control and change management.
59
59
 
@@ -67,13 +67,13 @@ Architecture evaluation is continuous and integrated into delivery workflows. Ca
67
67
 
68
68
  The onboarding guide is organized as four transition guides, one for each level transition:
69
69
 
70
- 1. [Your First Assessment](first-assessment.md) --- Level 1 to 2: Ad Hoc to Rubric-Based
71
- 2. [Governed Review](governed-review.md) --- Level 2 to 3: Rubric-Based to Governed
72
- 3. [Agent-Assisted Evaluation](agent-assisted.md) --- Level 3 to 4: Governed to Hybrid
73
- 4. [Scaling and Optimization](scaling-optimization.md) --- Level 4 to 5: Hybrid to Optimized
70
+ 1. [Your First Assessment](first-assessment.md) Level 1 to 2: Ad Hoc to Rubric-Based
71
+ 2. [Governed Review](governed-review.md) Level 2 to 3: Rubric-Based to Governed
72
+ 3. [Agent-Assisted Evaluation](agent-assisted.md) Level 3 to 4: Governed to Hybrid
73
+ 4. [Scaling and Optimization](scaling-optimization.md) Level 4 to 5: Hybrid to Optimized
74
74
 
75
75
  **Sequential reading is recommended.** Each guide builds on concepts introduced in the previous one. However, if you already know your current level from the self-assessment above, you can jump directly to the guide for your next transition.
76
76
 
77
- > **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation --- everything you need to move from ad hoc to rubric-based review.
77
+ > **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation everything you need to move from ad hoc to rubric-based review.
78
78
 
79
79
  For deeper reference material, see the [Getting Started guide](../getting-started.md), the [Terminology glossary](../terminology.md), and the full EAROS standard in `standard/EAROS.md`.
@@ -2,11 +2,11 @@
2
2
 
3
3
  > **Level 4 to 5: Hybrid to Optimized**
4
4
 
5
- Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability --- integrated into delivery workflows, continuously calibrated, and visible to leadership.
5
+ Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability integrated into delivery workflows, continuously calibrated, and visible to leadership.
6
6
 
7
7
  ## What Changes at This Level
8
8
 
9
- At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers --- triggered automatically, calibrated continuously, and reported to stakeholders who never touch a rubric YAML.
9
+ At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers triggered automatically, calibrated continuously, and reported to stakeholders who never touch a rubric YAML.
10
10
 
11
11
  ## CI/CD Integration
12
12
 
@@ -28,11 +28,11 @@ After merge, record evaluation results in a time-series store. This enables tren
28
28
 
29
29
  ### Architecture as code
30
30
 
31
- Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this --- artifacts conforming to `artifact.schema.json` can be validated, scored, and tracked automatically. Encourage teams to adopt structured artifact formats (YAML with frontmatter, ArchiMate exchange, diagram-as-code) rather than unstructured documents.
31
+ Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this artifacts conforming to `artifact.schema.json` can be validated, scored, and tracked automatically. Encourage teams to adopt structured artifact formats (YAML with frontmatter, ArchiMate exchange, diagram-as-code) rather than unstructured documents.
32
32
 
33
33
  ## Continuous Calibration
34
34
 
35
- At earlier levels, calibration is an event --- a scheduled exercise where reviewers score reference artifacts and compare results. At Level 5, calibration becomes continuous.
35
+ At earlier levels, calibration is an event a scheduled exercise where reviewers score reference artifacts and compare results. At Level 5, calibration becomes continuous.
36
36
 
37
37
  ### Wasserstein-based alignment
38
38
 
@@ -62,7 +62,7 @@ The five built-in profiles (solution-architecture, reference-architecture, adr,
62
62
 
63
63
  4. **Write up to 12 criteria.** Each criterion needs all required fields: `question`, `description`, `scoring_guide` (all 5 levels), `required_evidence`, `anti_patterns`, `examples.good`, `examples.bad`, `decision_tree`, and `remediation_hints`.
64
64
 
65
- 5. **Calibrate before production.** Score 3--5 representative artifacts with 2+ reviewers. Target kappa > 0.70.
65
+ 5. **Calibrate before production.** Score 35 representative artifacts with 2+ reviewers. Target kappa > 0.70.
66
66
 
67
67
  6. **Publish.** Validate against `rubric.schema.json`, add to the manifest, and document in the changelog.
68
68
 
@@ -74,9 +74,9 @@ For detailed authoring guidance, see the [Profile Authoring Guide](../profile-au
74
74
 
75
75
  Use the maturity model itself as a training roadmap. New team members start at Level 1 and progress through the guides:
76
76
 
77
- - **Week 1:** Complete [Your First Assessment](first-assessment.md) --- score a real artifact against the core rubric
78
- - **Week 2:** Complete [Governed Review](governed-review.md) --- join a calibration exercise, learn profiles and overlays
79
- - **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md) --- run a hybrid evaluation and reconcile disagreements
77
+ - **Week 1:** Complete [Your First Assessment](first-assessment.md) score a real artifact against the core rubric
78
+ - **Week 2:** Complete [Governed Review](governed-review.md) join a calibration exercise, learn profiles and overlays
79
+ - **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md) run a hybrid evaluation and reconcile disagreements
80
80
  - **Ongoing:** Participate in review rotations and calibration exercises
81
81
 
82
82
  ### Governance
@@ -90,7 +90,7 @@ Rubrics are governed assets at Level 5. This means:
90
90
 
91
91
  ### Culture
92
92
 
93
- The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate --- a hoop to jump through before deployment --- adoption will be grudging and superficial.
93
+ The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate a hoop to jump through before deployment adoption will be grudging and superficial.
94
94
 
95
95
  Position EAROS as a quality tool, not a gatekeeping tool:
96
96
 
@@ -110,8 +110,8 @@ The `earos-report` skill generates portfolio-level views from evaluation records
110
110
 
111
111
  - **Traffic-light dashboards:** Red/amber/green status for each evaluated artifact, grouped by team, domain, or portfolio
112
112
  - **Dimension trends:** Which quality dimensions are improving or declining across the portfolio over time
113
- - **Gate failure hotspots:** Which criteria most frequently trigger gate failures --- these are systemic weaknesses worth investing in
114
- - **Remediation tracking:** Status of actions from Conditional Pass evaluations --- are they being completed?
113
+ - **Gate failure hotspots:** Which criteria most frequently trigger gate failures these are systemic weaknesses worth investing in
114
+ - **Remediation tracking:** Status of actions from Conditional Pass evaluations are they being completed?
115
115
 
116
116
  ### Aggregating across the portfolio
117
117
 
@@ -139,7 +139,7 @@ A rising first-pass Pass rate is the strongest signal that EAROS is working: art
139
139
  ## Checkpoint: You Are at Level 5 When...
140
140
 
141
141
  - [ ] Architecture evaluation is integrated into your CI/CD or delivery pipeline
142
- - [ ] Calibration happens continuously, not just at setup time --- drift is detected and triggers re-calibration
142
+ - [ ] Calibration happens continuously, not just at setup time drift is detected and triggers re-calibration
143
143
  - [ ] You create and maintain custom profiles for your organization's artifact types
144
144
  - [ ] Executive reporting provides portfolio-level quality visibility on a regular cadence
145
145
  - [ ] Rubric updates follow a governed change process (version bumps, owner approval, re-calibration)
@@ -147,7 +147,7 @@ A rising first-pass Pass rate is the strongest signal that EAROS is working: art
147
147
 
148
148
  ## What Comes Next
149
149
 
150
- Level 5 is not a destination --- it is a steady state of continuous improvement. From here:
150
+ Level 5 is not a destination it is a steady state of continuous improvement. From here:
151
151
 
152
152
  - **Contribute back.** EAROS is open source. If you create profiles for artifact types that others would benefit from, consider contributing them to the project.
153
153
  - **Share calibration data.** Cross-organizational calibration data strengthens the framework for everyone. Anonymized score distributions help improve the Wasserstein calibration baselines.