@trohde/earos 1.2.0 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/assets/init/docs/getting-started.md +1 -1
- package/assets/init/docs/onboarding/agent-assisted.md +19 -19
- package/assets/init/docs/onboarding/first-assessment.md +18 -18
- package/assets/init/docs/onboarding/governed-review.md +10 -10
- package/assets/init/docs/onboarding/overview.md +15 -15
- package/assets/init/docs/onboarding/scaling-optimization.md +13 -13
- package/assets/init/docs/plans/2026-03-23-001-refactor-site-review-findings-plan.md +195 -0
- package/assets/init/docs/plans/2026-03-23-002-refactor-cli-review-findings-plan.md +736 -0
- package/assets/init/docs/terminology.md +1 -1
- package/bin.js +156 -36
- package/dist/assets/{_basePickBy-PmSUrUsK.js → _basePickBy-BlC_TeV6.js} +1 -1
- package/dist/assets/{_baseUniq-HuZouVIz.js → _baseUniq-CVy7rcC1.js} +1 -1
- package/dist/assets/{arc-CJFxtF3d.js → arc-Cd8wvd7z.js} +1 -1
- package/dist/assets/{architectureDiagram-2XIMDMQ5-XA-oU2UG.js → architectureDiagram-2XIMDMQ5-D_f4_aMp.js} +1 -1
- package/dist/assets/{blockDiagram-WCTKOSBZ-Oxp-wAST.js → blockDiagram-WCTKOSBZ-B-y6N5--.js} +1 -1
- package/dist/assets/{c4Diagram-IC4MRINW-D8m5hQH9.js → c4Diagram-IC4MRINW-C3-v3oNT.js} +1 -1
- package/dist/assets/channel-BSC0F15G.js +1 -0
- package/dist/assets/{chunk-4BX2VUAB-D2kBTn2O.js → chunk-4BX2VUAB-CMPwQN83.js} +1 -1
- package/dist/assets/{chunk-55IACEB6-Dxqrf5oZ.js → chunk-55IACEB6-Bdkfhvrr.js} +1 -1
- package/dist/assets/{chunk-FMBD7UC4-DoOEFFQC.js → chunk-FMBD7UC4-ptKQX5uF.js} +1 -1
- package/dist/assets/{chunk-JSJVCQXG-BerphV2K.js → chunk-JSJVCQXG-DO0UU_OX.js} +1 -1
- package/dist/assets/{chunk-KX2RTZJC-CxUAqT05.js → chunk-KX2RTZJC-DRj2OZnD.js} +1 -1
- package/dist/assets/{chunk-NQ4KR5QH-fCqZgFkU.js → chunk-NQ4KR5QH-C4Nsf7ww.js} +1 -1
- package/dist/assets/{chunk-QZHKN3VN-HlpHnJEy.js → chunk-QZHKN3VN-B1GO0Nwy.js} +1 -1
- package/dist/assets/{chunk-WL4C6EOR-D9yxAHyd.js → chunk-WL4C6EOR-lFR6fjR8.js} +1 -1
- package/dist/assets/classDiagram-VBA2DB6C-BHDWMOEz.js +1 -0
- package/dist/assets/classDiagram-v2-RAHNMMFH-BHDWMOEz.js +1 -0
- package/dist/assets/clone-BdN-3iAD.js +1 -0
- package/dist/assets/{cose-bilkent-S5V4N54A-F5xOBvqW.js → cose-bilkent-S5V4N54A-IpR9mVIO.js} +1 -1
- package/dist/assets/{dagre-KLK3FWXG-CD3BTpHv.js → dagre-KLK3FWXG-B4YA6T7N.js} +1 -1
- package/dist/assets/{diagram-E7M64L7V-C3D9MCay.js → diagram-E7M64L7V-Do5l6es_.js} +1 -1
- package/dist/assets/{diagram-IFDJBPK2-zJBVM-GK.js → diagram-IFDJBPK2-D5MxfKVv.js} +1 -1
- package/dist/assets/{diagram-P4PSJMXO-BrmFZOLB.js → diagram-P4PSJMXO-Djr28EgW.js} +1 -1
- package/dist/assets/{erDiagram-INFDFZHY-aSMhKiV2.js → erDiagram-INFDFZHY-BuM-rbCL.js} +1 -1
- package/dist/assets/{flowDiagram-PKNHOUZH-DwgX7l8F.js → flowDiagram-PKNHOUZH-By3WGI7Q.js} +1 -1
- package/dist/assets/{ganttDiagram-A5KZAMGK-C57Hz6QW.js → ganttDiagram-A5KZAMGK-GLmBfK72.js} +1 -1
- package/dist/assets/{gitGraphDiagram-K3NZZRJ6-CuchqqGh.js → gitGraphDiagram-K3NZZRJ6-BN0iXeIv.js} +1 -1
- package/dist/assets/{graph-CPFGBV5J.js → graph-CDzuMtjV.js} +1 -1
- package/dist/assets/{index-DMt1cpG6.js → index-DoeSN_Oe.js} +130 -130
- package/dist/assets/{infoDiagram-LFFYTUFH-Dd_5tfX7.js → infoDiagram-LFFYTUFH-C888gaFw.js} +1 -1
- package/dist/assets/{ishikawaDiagram-PHBUUO56-DwosSEvT.js → ishikawaDiagram-PHBUUO56-ChIO9DG-.js} +1 -1
- package/dist/assets/{journeyDiagram-4ABVD52K-BuCxcsX0.js → journeyDiagram-4ABVD52K-CufMUDcs.js} +1 -1
- package/dist/assets/{kanban-definition-K7BYSVSG-DF_1UCkW.js → kanban-definition-K7BYSVSG-BpsSVpX8.js} +1 -1
- package/dist/assets/{layout-DIcS6m1g.js → layout-B8RWVBSF.js} +1 -1
- package/dist/assets/{linear-BXkwBhoJ.js → linear-BJwxtq9r.js} +1 -1
- package/dist/assets/{mindmap-definition-YRQLILUH-DcDvYagd.js → mindmap-definition-YRQLILUH-C6WPimbf.js} +1 -1
- package/dist/assets/{pieDiagram-SKSYHLDU-BmeDeWDM.js → pieDiagram-SKSYHLDU-DeCGMWf8.js} +1 -1
- package/dist/assets/{quadrantDiagram-337W2JSQ-3zfjULUM.js → quadrantDiagram-337W2JSQ-D9TWaS83.js} +1 -1
- package/dist/assets/{requirementDiagram-Z7DCOOCP-B2wQMJpq.js → requirementDiagram-Z7DCOOCP-DTnuXlAq.js} +1 -1
- package/dist/assets/{sankeyDiagram-WA2Y5GQK-__kKlCTq.js → sankeyDiagram-WA2Y5GQK-B2dplCgD.js} +1 -1
- package/dist/assets/{sequenceDiagram-2WXFIKYE-B7O81Vih.js → sequenceDiagram-2WXFIKYE-cBvgSSju.js} +1 -1
- package/dist/assets/{stateDiagram-RAJIS63D-CcJaDrAK.js → stateDiagram-RAJIS63D-Cwr7VtSX.js} +1 -1
- package/dist/assets/stateDiagram-v2-FVOUBMTO-B59h7VTZ.js +1 -0
- package/dist/assets/{timeline-definition-YZTLITO2-DSaQQqIU.js → timeline-definition-YZTLITO2-Dkp163fK.js} +1 -1
- package/dist/assets/{treemap-KZPCXAKY-9Hcrd8XD.js → treemap-KZPCXAKY-BUWHa5xU.js} +1 -1
- package/dist/assets/{vennDiagram-LZ73GAT5-BqHNyca2.js → vennDiagram-LZ73GAT5-BihD66ma.js} +1 -1
- package/dist/assets/{xychartDiagram-JWTSCODW-BqeYf6Fk.js → xychartDiagram-JWTSCODW-Cw4lPbuZ.js} +1 -1
- package/dist/index.html +1 -1
- package/export-docx.js +12 -4
- package/init.js +19 -14
- package/manifest-cli.mjs +32 -3
- package/package.json +3 -2
- package/serve.js +44 -19
- package/utils/export-markdown.js +486 -0
- package/dist/assets/channel-SoktpVBQ.js +0 -1
- package/dist/assets/classDiagram-VBA2DB6C-BT2AdZTe.js +0 -1
- package/dist/assets/classDiagram-v2-RAHNMMFH-BT2AdZTe.js +0 -1
- package/dist/assets/clone-DOjIfi5r.js +0 -1
- package/dist/assets/stateDiagram-v2-FVOUBMTO-B2goOPt-.js +0 -1
|
@@ -24,7 +24,7 @@ EaROS has profiles for the most common enterprise architecture artifact types:
|
|
|
24
24
|
| Architecture Decision Record (ADR) | `profiles/adr.yaml` | Approved |
|
|
25
25
|
| Capability map | `profiles/capability-map.yaml` | Approved |
|
|
26
26
|
| Architecture roadmap | `profiles/roadmap.yaml` | Draft |
|
|
27
|
-
| Other / unknown | Core only: `core/core-meta-rubric.yaml` |
|
|
27
|
+
| Other / unknown | Core only: `core/core-meta-rubric.yaml` | — |
|
|
28
28
|
|
|
29
29
|
> **Status:** *Approved* profiles have completed calibration. *Draft* profiles are usable but have not yet been calibrated with inter-rater reliability measured. Check `earos.manifest.yaml` for the latest status of each rubric.
|
|
30
30
|
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
> **Level 3 to 4: Governed to Hybrid**
|
|
4
4
|
|
|
5
|
-
Your team produces governed, calibrated evaluations. Now you bring AI agents into the process
|
|
5
|
+
Your team produces governed, calibrated evaluations. Now you bring AI agents into the process — not to replace human reviewers, but to provide an independent second perspective that strengthens every assessment.
|
|
6
6
|
|
|
7
7
|
## What Changes at This Level
|
|
8
8
|
|
|
@@ -14,23 +14,23 @@ Agent evaluations follow an 8-step directed acyclic graph (DAG). Each step must
|
|
|
14
14
|
|
|
15
15
|
### The 8-Step DAG Evaluation Flow
|
|
16
16
|
|
|
17
|
-
**Step 1
|
|
17
|
+
**Step 1 — Structural Validation.** The agent confirms the artifact conforms to its declared type. Does it have the expected sections? Is it machine-readable or does it require OCR? Can the agent identify the artifact's scope and purpose?
|
|
18
18
|
|
|
19
|
-
**Step 2
|
|
19
|
+
**Step 2 — Content Extraction.** The agent identifies sections, diagrams, traceability elements, and key content areas. This builds a map of the artifact's structure before scoring begins.
|
|
20
20
|
|
|
21
|
-
**Step 3
|
|
21
|
+
**Step 3 — Criterion Scoring.** The agent applies the RULERS protocol to each criterion: extract a direct quote or reference from the artifact as evidence, then match it against the `scoring_guide` level descriptors to assign a 0–4 score. If no evidence can be found, score N/A and explain why.
|
|
22
22
|
|
|
23
|
-
**Step 4
|
|
23
|
+
**Step 4 — Cross-Reference Validation.** The agent checks consistency across views: do component names match across diagrams? Do interface definitions agree between the API contract and the sequence diagram? Are there contradictions between sections?
|
|
24
24
|
|
|
25
|
-
**Step 5
|
|
25
|
+
**Step 5 — Dimension Aggregation.** The agent computes weighted dimension averages using the dimension weights defined in the rubric.
|
|
26
26
|
|
|
27
|
-
**Step 6
|
|
27
|
+
**Step 6 — Challenge Pass.** A second perspective (another agent instance or a human) challenges the evaluator's highest and lowest scores. Are the highest scores supported by strong observed evidence, or are they inflated? Are the lowest scores genuinely that weak, or did the evaluator miss relevant content?
|
|
28
28
|
|
|
29
|
-
**Step 7
|
|
29
|
+
**Step 7 — Calibration.** The agent aligns its score distribution to reference human distributions using the Wasserstein-based method (`rulers_wasserstein`). This prevents systematic over-scoring or under-scoring relative to human reviewers.
|
|
30
30
|
|
|
31
|
-
**Step 8
|
|
31
|
+
**Step 8 — Status Determination.** Gates are checked first (a critical gate failure blocks a passing status — the specific outcome, `Reject` or `Not Reviewable`, is determined by the criterion's `failure_effect`), then the weighted average is computed and applied against the status thresholds.
|
|
32
32
|
|
|
33
|
-
> **The DAG is not optional.** Skipping steps
|
|
33
|
+
> **The DAG is not optional.** Skipping steps — particularly the challenge pass (Step 6) — undermines evaluation quality. An agent evaluation without a challenge pass is an unchecked evaluation.
|
|
34
34
|
|
|
35
35
|
## Setting Up Agent Evaluation
|
|
36
36
|
|
|
@@ -47,9 +47,9 @@ The workspace includes 10 EAROS skills. The three most relevant for agent evalua
|
|
|
47
47
|
|
|
48
48
|
| Skill | Purpose |
|
|
49
49
|
|-------|---------|
|
|
50
|
-
| `earos-assess` | Primary evaluation
|
|
51
|
-
| `earos-review` | Challenger
|
|
52
|
-
| `earos-template-fill` | Author guide
|
|
50
|
+
| `earos-assess` | Primary evaluation — runs the full 8-step DAG on any artifact |
|
|
51
|
+
| `earos-review` | Challenger — audits an existing evaluation for over-scoring and unsupported claims |
|
|
52
|
+
| `earos-template-fill` | Author guide — coaches artifact authors through writing assessment-ready documents |
|
|
53
53
|
|
|
54
54
|
### With Other AI Agents
|
|
55
55
|
|
|
@@ -65,7 +65,7 @@ To run an agent assessment, provide the artifact and invoke the `earos-assess` s
|
|
|
65
65
|
4. Execute the full 8-step DAG
|
|
66
66
|
5. Produce an evaluation record conforming to `evaluation.schema.json`
|
|
67
67
|
|
|
68
|
-
The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable
|
|
68
|
+
The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable — you can trace each one back to the evidence that supports it.
|
|
69
69
|
|
|
70
70
|
## The Hybrid Model
|
|
71
71
|
|
|
@@ -75,7 +75,7 @@ The hybrid model is the defining practice of Level 4. Here is how it works:
|
|
|
75
75
|
|
|
76
76
|
2. **Score comparison.** After both evaluations are complete, compare results criterion by criterion.
|
|
77
77
|
|
|
78
|
-
3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference
|
|
78
|
+
3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference — go back to the `scoring_guide` level descriptors and determine which score more accurately reflects the evidence.
|
|
79
79
|
|
|
80
80
|
4. **Reconciled record.** The final evaluation record captures both evaluators (mode: human and mode: agent) and notes where reconciliation occurred.
|
|
81
81
|
|
|
@@ -87,7 +87,7 @@ The hybrid model is the defining practice of Level 4. Here is how it works:
|
|
|
87
87
|
|
|
88
88
|
## The Challenge Pass
|
|
89
89
|
|
|
90
|
-
Step 6 of the DAG
|
|
90
|
+
Step 6 of the DAG — the challenge pass — deserves special attention because it is the most commonly skipped step and the most valuable.
|
|
91
91
|
|
|
92
92
|
In the challenge pass, a second perspective reviews the evaluation and specifically targets:
|
|
93
93
|
|
|
@@ -104,11 +104,11 @@ At Level 4, you begin tracking quantitative metrics for evaluation quality:
|
|
|
104
104
|
| Metric | Target | What It Tells You |
|
|
105
105
|
|--------|--------|------------------|
|
|
106
106
|
| **Cohen's kappa** (human-agent) | > 0.70 | Agreement between human and agent after calibration |
|
|
107
|
-
| **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation
|
|
107
|
+
| **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation — do human and agent agree on which criteria are strong vs. weak? |
|
|
108
108
|
| **Gate failure rate** | Track trend | How often critical or major gates fail, and for which criteria |
|
|
109
109
|
| **Score distribution** | Compare over time | Are scores clustering (suggesting rubber-stamping) or well-distributed? |
|
|
110
110
|
|
|
111
|
-
Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift
|
|
111
|
+
Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift — time to re-calibrate.
|
|
112
112
|
|
|
113
113
|
## Checkpoint: You Are at Level 4 When...
|
|
114
114
|
|
|
@@ -116,7 +116,7 @@ Track these metrics per rubric, per team, and over time. A declining kappa sugge
|
|
|
116
116
|
- [ ] Every agent evaluation includes a challenge pass (Step 6)
|
|
117
117
|
- [ ] Human-agent disagreements of 2 or more points are routinely resolved against level descriptors
|
|
118
118
|
- [ ] You track inter-rater reliability metrics (kappa and/or Spearman's rho)
|
|
119
|
-
- [ ] Agent evaluations are auditable
|
|
119
|
+
- [ ] Agent evaluations are auditable — evidence anchors, evidence classes, and confidence are captured for every score
|
|
120
120
|
|
|
121
121
|
## Next Steps
|
|
122
122
|
|
|
@@ -8,14 +8,14 @@ This guide walks you from zero to your first scored architecture evaluation. By
|
|
|
8
8
|
|
|
9
9
|
- How to install the EAROS CLI and initialize a workspace
|
|
10
10
|
- The 9 dimensions and 10 criteria of the core meta-rubric
|
|
11
|
-
- How the 0
|
|
11
|
+
- How the 0–4 scoring scale works in practice
|
|
12
12
|
- How to find evidence in an artifact, cite it, and assign a score
|
|
13
13
|
- How gates work and when they override the average
|
|
14
14
|
- How to interpret your evaluation result
|
|
15
15
|
|
|
16
16
|
## Prerequisites
|
|
17
17
|
|
|
18
|
-
You need one thing: **an architecture artifact to assess**. This can be any document your organization produces
|
|
18
|
+
You need one thing: **an architecture artifact to assess**. This can be any document your organization produces — a solution design, an ADR, a capability map, a reference architecture, a roadmap. It does not need to be perfect; in fact, a flawed artifact is more instructive for a first assessment.
|
|
19
19
|
|
|
20
20
|
## Installing the CLI
|
|
21
21
|
|
|
@@ -31,7 +31,7 @@ Then initialize a workspace in your project directory:
|
|
|
31
31
|
earos init my-workspace
|
|
32
32
|
```
|
|
33
33
|
|
|
34
|
-
This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained
|
|
34
|
+
This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained — everything you need is scaffolded into the directory.
|
|
35
35
|
|
|
36
36
|
## Understanding the Workspace
|
|
37
37
|
|
|
@@ -39,11 +39,11 @@ This creates a complete EAROS workspace with rubric files, JSON schemas, agent s
|
|
|
39
39
|
|
|
40
40
|
The `earos init` command creates a structured directory containing:
|
|
41
41
|
|
|
42
|
-
- **Rubric files**
|
|
43
|
-
- **JSON schemas**
|
|
44
|
-
- **Agent skills**
|
|
45
|
-
- **AGENTS.md**
|
|
46
|
-
- **Manifest**
|
|
42
|
+
- **Rubric files** — the core meta-rubric and all built-in profiles and overlays (YAML)
|
|
43
|
+
- **JSON schemas** — for validating rubrics, evaluation records, and artifact documents
|
|
44
|
+
- **Agent skills** — 10 pre-configured skills for AI-assisted evaluation (in `.agents/skills/`)
|
|
45
|
+
- **AGENTS.md** — agent-agnostic instructions for AI tools like Cursor, Copilot, and Windsurf
|
|
46
|
+
- **Manifest** — an inventory of all available rubrics
|
|
47
47
|
|
|
48
48
|
## The Core Meta-Rubric
|
|
49
49
|
|
|
@@ -61,9 +61,9 @@ The core meta-rubric (`EAROS-CORE-002`) is the universal foundation. It applies
|
|
|
61
61
|
| **D8: Actionability and implementation relevance** | Can a delivery team act on this artifact without significant guesswork? |
|
|
62
62
|
| **D9: Artifact maintainability and stewardship** | Is the artifact versioned, owned, and structured so it can be maintained over time? |
|
|
63
63
|
|
|
64
|
-
For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact
|
|
64
|
+
For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact — 10 criteria is manageable for a first pass.
|
|
65
65
|
|
|
66
|
-
## The 0
|
|
66
|
+
## The 0–4 Scoring Scale
|
|
67
67
|
|
|
68
68
|
Every criterion uses the same ordinal scale:
|
|
69
69
|
|
|
@@ -90,23 +90,23 @@ Follow these steps for each of the 10 criteria:
|
|
|
90
90
|
|
|
91
91
|
3. **Record the evidence reference.** Write down where you found it: section number, page, diagram ID, or a direct quote. "Section 3 states: 'Primary stakeholders are the CTO and Head of Payments'" is valid evidence. "The artifact seems to address this" is not.
|
|
92
92
|
|
|
93
|
-
4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure
|
|
93
|
+
4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure — it provides IF/THEN logic for resolving ambiguous cases.
|
|
94
94
|
|
|
95
95
|
5. **Move to the next criterion.** Repeat for all 10 criteria.
|
|
96
96
|
|
|
97
97
|
## Understanding Gates
|
|
98
98
|
|
|
99
|
-
Not all criteria are equal. Some have **gates**
|
|
99
|
+
Not all criteria are equal. Some have **gates** — threshold controls that can block a passing status regardless of how well you score on everything else.
|
|
100
100
|
|
|
101
101
|
| Gate Severity | What Happens |
|
|
102
102
|
|---------------|-------------|
|
|
103
|
-
| **Critical** | If the score is below the gate threshold, the artifact is blocked from passing. The `failure_effect` determines the outcome
|
|
103
|
+
| **Critical** | If the score is below the gate threshold, the artifact is blocked from passing. The `failure_effect` determines the outcome — typically **Reject**, or **Not Reviewable** when evidence is too incomplete to score. |
|
|
104
104
|
| **Major** | A low score caps the maximum achievable status (e.g., cannot pass above Conditional Pass). |
|
|
105
105
|
| **Advisory** | A low score triggers a recommendation but does not block any status. |
|
|
106
106
|
|
|
107
107
|
In the core rubric, two criteria have **critical** gates: **SCP-01** (Scope and boundary clarity) — if the scope is so unclear that the artifact cannot be reviewed (score < 2), the result is "Not Reviewable" regardless of all other scores; and **CMP-01** (Standards and policy compliance) — if mandatory control compliance cannot be determined, the result is "Reject". **STK-01** (Stakeholder and purpose fit) and **TRC-01** (Traceability) have **major** gates.
|
|
108
108
|
|
|
109
|
-
> **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop
|
|
109
|
+
> **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop — the result is determined by the gate's `failure_effect` (Reject or Not Reviewable). Only then compute the weighted average for the remaining status thresholds.
|
|
110
110
|
|
|
111
111
|
## Interpreting Your Results
|
|
112
112
|
|
|
@@ -115,25 +115,25 @@ After scoring all criteria and checking gates, compute the weighted average acro
|
|
|
115
115
|
| Status | Threshold |
|
|
116
116
|
|--------|-----------|
|
|
117
117
|
| **Pass** | No critical gate failure, overall average >= 3.2, and no dimension average < 2.0 |
|
|
118
|
-
| **Conditional Pass** | No critical gate failure, overall average 2.4
|
|
118
|
+
| **Conditional Pass** | No critical gate failure, overall average 2.4–3.19 (weaknesses are containable with named actions) |
|
|
119
119
|
| **Rework Required** | Overall average < 2.4, or repeated weak dimensions, or insufficient evidence |
|
|
120
120
|
| **Reject** | Any critical gate failure, or mandatory control breach |
|
|
121
121
|
| **Not Reviewable** | Evidence too incomplete to score responsibly |
|
|
122
122
|
|
|
123
123
|

|
|
124
124
|
|
|
125
|
-
A Conditional Pass is not a failure
|
|
125
|
+
A Conditional Pass is not a failure — it means the artifact is close but needs specific, named improvements before it is decision-ready. Record those improvements as actions in the evaluation record.
|
|
126
126
|
|
|
127
127
|
## Checkpoint: You Are at Level 2 When...
|
|
128
128
|
|
|
129
129
|
- [ ] You have completed at least one assessment using the core meta-rubric
|
|
130
|
-
- [ ] Every score has a cited evidence reference
|
|
130
|
+
- [ ] Every score has a cited evidence reference — not "seems adequate" but a specific section, page, or quote
|
|
131
131
|
- [ ] You can explain the difference between a score of 2 and a score of 3 for any criterion
|
|
132
132
|
- [ ] You understand which gates would block a Pass status and why
|
|
133
133
|
- [ ] Your evaluation result includes a status determination (Pass, Conditional Pass, Rework Required, Reject, or Not Reviewable)
|
|
134
134
|
|
|
135
135
|
## Next Steps
|
|
136
136
|
|
|
137
|
-
You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process
|
|
137
|
+
You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process — with artifact-specific profiles, cross-cutting overlays, and calibrated scoring.
|
|
138
138
|
|
|
139
139
|
Continue to [Governed Review](governed-review.md).
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
> **Level 2 to 3: Rubric-Based to Governed**
|
|
4
4
|
|
|
5
|
-
You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice
|
|
5
|
+
You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice — with artifact-specific profiles, context-driven overlays, calibrated teams, and evidence-anchored scoring that is reproducible across your organization.
|
|
6
6
|
|
|
7
7
|
## What Changes at This Level
|
|
8
8
|
|
|
@@ -16,7 +16,7 @@ At Level 2, you used the core rubric and produced evidence-backed scores. At Lev
|
|
|
16
16
|
|
|
17
17
|
## Choosing a Profile
|
|
18
18
|
|
|
19
|
-
The core meta-rubric's 10 criteria are universal
|
|
19
|
+
The core meta-rubric's 10 criteria are universal — they apply to every architecture artifact. But a reference architecture has different quality expectations than an ADR, and a capability map is evaluated differently than a roadmap. Profiles add artifact-specific criteria on top of the core — typically 3 to 9, depending on the artifact type.
|
|
20
20
|
|
|
21
21
|
| Profile | Artifact Type | Status | What It Adds |
|
|
22
22
|
|---------|--------------|--------|-------------|
|
|
@@ -28,7 +28,7 @@ The core meta-rubric's 10 criteria are universal --- they apply to every archite
|
|
|
28
28
|
|
|
29
29
|
> **Status key:** *Approved* profiles have been calibrated and are ready for governed use. *Draft* profiles are usable but have not completed the full calibration process (see [Calibrating with Your Team](#calibrating-with-your-team)).
|
|
30
30
|
|
|
31
|
-
Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria
|
|
31
|
+
Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria — 13–19 criteria total depending on the profile.
|
|
32
32
|
|
|
33
33
|
> **How to choose:** Match the profile to the artifact's declared type. If the artifact does not fit any built-in profile, use the core rubric alone. Creating custom profiles is covered in [Scaling and Optimization](scaling-optimization.md).
|
|
34
34
|
|
|
@@ -44,7 +44,7 @@ Overlays inject cross-cutting concerns that apply across artifact types. Unlike
|
|
|
44
44
|
| **Data Governance** (`data-governance.yaml`) | Approved | The artifact describes data flows, data retention, data classification, or data lineage |
|
|
45
45
|
| **Regulatory** (`regulatory.yaml`) | Draft | The artifact operates in a regulated domain: payments, healthcare, financial reporting, privacy |
|
|
46
46
|
|
|
47
|
-
Overlays are additive
|
|
47
|
+
Overlays are additive — they append criteria to the base rubric (core + profile). They cannot remove or weaken gates from the base. An overlay's critical gate adds to the gate model; it does not replace it.
|
|
48
48
|
|
|
49
49
|
You can apply multiple overlays simultaneously. A payments solution architecture might use the solution-architecture profile with both the security and regulatory overlays.
|
|
50
50
|
|
|
@@ -58,7 +58,7 @@ For each criterion:
|
|
|
58
58
|
|
|
59
59
|
1. Search the artifact for content that addresses the criterion
|
|
60
60
|
2. If you find it, record the evidence anchor: a direct quote, section reference, or diagram ID
|
|
61
|
-
3. Then
|
|
61
|
+
3. Then — and only then — match the evidence against the `scoring_guide` level descriptors
|
|
62
62
|
4. If you cannot find evidence, record N/A and explain why the criterion does not apply, or score 0 and note the absence
|
|
63
63
|
|
|
64
64
|
Never score from impression. "The artifact seems to address security" is not evidence. "Section 7.2 states: 'All inter-service communication uses mTLS with certificates rotated every 90 days'" is evidence.
|
|
@@ -73,7 +73,7 @@ Every piece of evidence you cite must be classified:
|
|
|
73
73
|
| **Inferred** | A reasonable interpretation of content that is not directly stated | Medium |
|
|
74
74
|
| **External** | Judgment based on a standard, policy, or source outside the artifact | Lowest |
|
|
75
75
|
|
|
76
|
-
Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps
|
|
76
|
+
Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps — which is itself a finding worth recording.
|
|
77
77
|
|
|
78
78
|
## The Three Evaluation Types
|
|
79
79
|
|
|
@@ -87,7 +87,7 @@ EAROS distinguishes three distinct judgment types that should not be collapsed i
|
|
|
87
87
|
|
|
88
88
|
These are related but distinct. A beautifully written, complete document can describe an architecturally unsound system. A technically excellent architecture can be documented in an unmaintainable artifact. Collapsing these into one score hides critical information.
|
|
89
89
|
|
|
90
|
-
In practice, EAROS criteria map to these three types through the dimension structure
|
|
90
|
+
In practice, EAROS criteria map to these three types through the dimension structure — the narrative summary in the evaluation record should address all three perspectives. The rubric's criterion scores provide the evidence base; the narrative synthesizes them into these three distinct judgments.
|
|
91
91
|
|
|
92
92
|

|
|
93
93
|
|
|
@@ -97,7 +97,7 @@ Calibration is what transforms individual scoring into a team capability. Withou
|
|
|
97
97
|
|
|
98
98
|
### Step-by-step calibration exercise
|
|
99
99
|
|
|
100
|
-
1. **Select 3
|
|
100
|
+
1. **Select 3–5 representative artifacts.** Aim for diversity: one strong artifact, one weak, one ambiguous, and one incomplete. The gold-standard example at `examples/aws-event-driven-order-processing/` is an excellent starting point.
|
|
101
101
|
|
|
102
102
|
2. **Have 2+ reviewers score independently.** Each reviewer scores the same artifact against the same rubric without discussing their scores.
|
|
103
103
|
|
|
@@ -125,7 +125,7 @@ For more on evaluation record structure, see the [Getting Started guide](../gett
|
|
|
125
125
|
## Checkpoint: You Are at Level 3 When...
|
|
126
126
|
|
|
127
127
|
- [ ] Your team uses a matching profile (not just the core rubric) for every assessment
|
|
128
|
-
- [ ] Every score uses the RULERS protocol
|
|
128
|
+
- [ ] Every score uses the RULERS protocol — evidence anchor first, then score
|
|
129
129
|
- [ ] You have completed a calibration exercise with kappa > 0.70
|
|
130
130
|
- [ ] Overlays are applied based on context (not arbitrarily or never)
|
|
131
131
|
- [ ] Evaluation records are structured and conform to `evaluation.schema.json`
|
|
@@ -133,6 +133,6 @@ For more on evaluation record structure, see the [Getting Started guide](../gett
|
|
|
133
133
|
|
|
134
134
|
## Next Steps
|
|
135
135
|
|
|
136
|
-
Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process
|
|
136
|
+
Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process — not to replace human judgment, but to augment it with a second independent perspective.
|
|
137
137
|
|
|
138
138
|
Continue to [Agent-Assisted Evaluation](agent-assisted.md).
|
|
@@ -2,12 +2,12 @@
|
|
|
2
2
|
|
|
3
3
|
## Why Staged Adoption Matters
|
|
4
4
|
|
|
5
|
-
Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem
|
|
5
|
+
Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem — it is a capability maturity problem.
|
|
6
6
|
|
|
7
7
|
The EAROS Adoption Maturity Model draws on three decades of maturity research:
|
|
8
8
|
|
|
9
9
|
- **CMMI** (Capability Maturity Model Integration) established the 5-level progression from initial/ad hoc to optimizing, demonstrating that process maturity is built incrementally.
|
|
10
|
-
- **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement
|
|
10
|
+
- **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement — not tooling alone.
|
|
11
11
|
- **OMB EAAF** (Enterprise Architecture Assessment Framework) showed that federal agencies succeed when they build capability in stages aligned to organizational readiness.
|
|
12
12
|
- **TOGAF ACMM** (Architecture Capability Maturity Model) provided the architecture-specific framing: maturity grows from informal practices through defined processes to measured and optimized operations.
|
|
13
13
|
|
|
@@ -15,25 +15,25 @@ EAROS applies these lessons to a specific domain: architecture artifact evaluati
|
|
|
15
15
|
|
|
16
16
|
## The Five Levels
|
|
17
17
|
|
|
18
|
-
### Level 1
|
|
18
|
+
### Level 1 — Ad Hoc
|
|
19
19
|
|
|
20
20
|
No formal review process. Evaluation quality depends entirely on who happens to review the artifact. Different reviewers apply different mental models, and feedback is inconsistent and unreproducible.
|
|
21
21
|
|
|
22
22
|
- **Key practices:** Informal peer review, tribal knowledge
|
|
23
23
|
- **EAROS capabilities:** None (this is the baseline state)
|
|
24
|
-
- **You are here when:** You recognize the problem
|
|
24
|
+
- **You are here when:** You recognize the problem — reviews are inconsistent and reviewer-dependent
|
|
25
25
|
|
|
26
|
-
### Level 2
|
|
26
|
+
### Level 2 — Rubric-Based
|
|
27
27
|
|
|
28
|
-
The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 0
|
|
28
|
+
The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 0–4 scoring scale. Evidence is cited for every score. Results are reproducible across reviewers.
|
|
29
29
|
|
|
30
30
|
- **Key practices:** Manual scoring against core meta-rubric, evidence citation for every score, gate checking
|
|
31
|
-
- **EAROS capabilities:** Core meta-rubric, scoring sheets, 0
|
|
31
|
+
- **EAROS capabilities:** Core meta-rubric, scoring sheets, 0–4 scale
|
|
32
32
|
- **You are here when:** You have completed at least one assessment using the core rubric with evidence for every score
|
|
33
33
|
|
|
34
34
|
> **Guide:** [Your First Assessment](first-assessment.md) walks you through this transition.
|
|
35
35
|
|
|
36
|
-
### Level 3
|
|
36
|
+
### Level 3 — Governed
|
|
37
37
|
|
|
38
38
|
Artifact-specific profiles and context-driven overlays are in use. Teams are calibrated against reference examples. The RULERS protocol ensures evidence-anchored scoring. Evaluation records are structured and auditable.
|
|
39
39
|
|
|
@@ -43,7 +43,7 @@ Artifact-specific profiles and context-driven overlays are in use. Teams are cal
|
|
|
43
43
|
|
|
44
44
|
> **Guide:** [Governed Review](governed-review.md) walks you through this transition.
|
|
45
45
|
|
|
46
|
-
### Level 4
|
|
46
|
+
### Level 4 — Hybrid
|
|
47
47
|
|
|
48
48
|
AI agents augment human reviewers. Both evaluate independently and reconcile disagreements against level descriptors. Metrics track inter-rater reliability between human and agent evaluators.
|
|
49
49
|
|
|
@@ -53,7 +53,7 @@ AI agents augment human reviewers. Both evaluate independently and reconcile dis
|
|
|
53
53
|
|
|
54
54
|
> **Guide:** [Agent-Assisted Evaluation](agent-assisted.md) walks you through this transition.
|
|
55
55
|
|
|
56
|
-
### Level 5
|
|
56
|
+
### Level 5 — Optimized
|
|
57
57
|
|
|
58
58
|
Architecture evaluation is continuous and integrated into delivery workflows. Calibration happens automatically. Executive reporting provides portfolio-level quality visibility. Rubrics are governed assets with version control and change management.
|
|
59
59
|
|
|
@@ -67,13 +67,13 @@ Architecture evaluation is continuous and integrated into delivery workflows. Ca
|
|
|
67
67
|
|
|
68
68
|
The onboarding guide is organized as four transition guides, one for each level transition:
|
|
69
69
|
|
|
70
|
-
1. [Your First Assessment](first-assessment.md)
|
|
71
|
-
2. [Governed Review](governed-review.md)
|
|
72
|
-
3. [Agent-Assisted Evaluation](agent-assisted.md)
|
|
73
|
-
4. [Scaling and Optimization](scaling-optimization.md)
|
|
70
|
+
1. [Your First Assessment](first-assessment.md) — Level 1 to 2: Ad Hoc to Rubric-Based
|
|
71
|
+
2. [Governed Review](governed-review.md) — Level 2 to 3: Rubric-Based to Governed
|
|
72
|
+
3. [Agent-Assisted Evaluation](agent-assisted.md) — Level 3 to 4: Governed to Hybrid
|
|
73
|
+
4. [Scaling and Optimization](scaling-optimization.md) — Level 4 to 5: Hybrid to Optimized
|
|
74
74
|
|
|
75
75
|
**Sequential reading is recommended.** Each guide builds on concepts introduced in the previous one. However, if you already know your current level from the self-assessment above, you can jump directly to the guide for your next transition.
|
|
76
76
|
|
|
77
|
-
> **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation
|
|
77
|
+
> **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation — everything you need to move from ad hoc to rubric-based review.
|
|
78
78
|
|
|
79
79
|
For deeper reference material, see the [Getting Started guide](../getting-started.md), the [Terminology glossary](../terminology.md), and the full EAROS standard in `standard/EAROS.md`.
|
|
@@ -2,11 +2,11 @@
|
|
|
2
2
|
|
|
3
3
|
> **Level 4 to 5: Hybrid to Optimized**
|
|
4
4
|
|
|
5
|
-
Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability
|
|
5
|
+
Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability — integrated into delivery workflows, continuously calibrated, and visible to leadership.
|
|
6
6
|
|
|
7
7
|
## What Changes at This Level
|
|
8
8
|
|
|
9
|
-
At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers
|
|
9
|
+
At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers — triggered automatically, calibrated continuously, and reported to stakeholders who never touch a rubric YAML.
|
|
10
10
|
|
|
11
11
|
## CI/CD Integration
|
|
12
12
|
|
|
@@ -28,11 +28,11 @@ After merge, record evaluation results in a time-series store. This enables tren
|
|
|
28
28
|
|
|
29
29
|
### Architecture as code
|
|
30
30
|
|
|
31
|
-
Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this
|
|
31
|
+
Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this — artifacts conforming to `artifact.schema.json` can be validated, scored, and tracked automatically. Encourage teams to adopt structured artifact formats (YAML with frontmatter, ArchiMate exchange, diagram-as-code) rather than unstructured documents.
|
|
32
32
|
|
|
33
33
|
## Continuous Calibration
|
|
34
34
|
|
|
35
|
-
At earlier levels, calibration is an event
|
|
35
|
+
At earlier levels, calibration is an event — a scheduled exercise where reviewers score reference artifacts and compare results. At Level 5, calibration becomes continuous.
|
|
36
36
|
|
|
37
37
|
### Wasserstein-based alignment
|
|
38
38
|
|
|
@@ -62,7 +62,7 @@ The five built-in profiles (solution-architecture, reference-architecture, adr,
|
|
|
62
62
|
|
|
63
63
|
4. **Write up to 12 criteria.** Each criterion needs all required fields: `question`, `description`, `scoring_guide` (all 5 levels), `required_evidence`, `anti_patterns`, `examples.good`, `examples.bad`, `decision_tree`, and `remediation_hints`.
|
|
64
64
|
|
|
65
|
-
5. **Calibrate before production.** Score 3
|
|
65
|
+
5. **Calibrate before production.** Score 3–5 representative artifacts with 2+ reviewers. Target kappa > 0.70.
|
|
66
66
|
|
|
67
67
|
6. **Publish.** Validate against `rubric.schema.json`, add to the manifest, and document in the changelog.
|
|
68
68
|
|
|
@@ -74,9 +74,9 @@ For detailed authoring guidance, see the [Profile Authoring Guide](../profile-au
|
|
|
74
74
|
|
|
75
75
|
Use the maturity model itself as a training roadmap. New team members start at Level 1 and progress through the guides:
|
|
76
76
|
|
|
77
|
-
- **Week 1:** Complete [Your First Assessment](first-assessment.md)
|
|
78
|
-
- **Week 2:** Complete [Governed Review](governed-review.md)
|
|
79
|
-
- **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md)
|
|
77
|
+
- **Week 1:** Complete [Your First Assessment](first-assessment.md) — score a real artifact against the core rubric
|
|
78
|
+
- **Week 2:** Complete [Governed Review](governed-review.md) — join a calibration exercise, learn profiles and overlays
|
|
79
|
+
- **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md) — run a hybrid evaluation and reconcile disagreements
|
|
80
80
|
- **Ongoing:** Participate in review rotations and calibration exercises
|
|
81
81
|
|
|
82
82
|
### Governance
|
|
@@ -90,7 +90,7 @@ Rubrics are governed assets at Level 5. This means:
|
|
|
90
90
|
|
|
91
91
|
### Culture
|
|
92
92
|
|
|
93
|
-
The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate
|
|
93
|
+
The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate — a hoop to jump through before deployment — adoption will be grudging and superficial.
|
|
94
94
|
|
|
95
95
|
Position EAROS as a quality tool, not a gatekeeping tool:
|
|
96
96
|
|
|
@@ -110,8 +110,8 @@ The `earos-report` skill generates portfolio-level views from evaluation records
|
|
|
110
110
|
|
|
111
111
|
- **Traffic-light dashboards:** Red/amber/green status for each evaluated artifact, grouped by team, domain, or portfolio
|
|
112
112
|
- **Dimension trends:** Which quality dimensions are improving or declining across the portfolio over time
|
|
113
|
-
- **Gate failure hotspots:** Which criteria most frequently trigger gate failures
|
|
114
|
-
- **Remediation tracking:** Status of actions from Conditional Pass evaluations
|
|
113
|
+
- **Gate failure hotspots:** Which criteria most frequently trigger gate failures — these are systemic weaknesses worth investing in
|
|
114
|
+
- **Remediation tracking:** Status of actions from Conditional Pass evaluations — are they being completed?
|
|
115
115
|
|
|
116
116
|
### Aggregating across the portfolio
|
|
117
117
|
|
|
@@ -139,7 +139,7 @@ A rising first-pass Pass rate is the strongest signal that EAROS is working: art
|
|
|
139
139
|
## Checkpoint: You Are at Level 5 When...
|
|
140
140
|
|
|
141
141
|
- [ ] Architecture evaluation is integrated into your CI/CD or delivery pipeline
|
|
142
|
-
- [ ] Calibration happens continuously, not just at setup time
|
|
142
|
+
- [ ] Calibration happens continuously, not just at setup time — drift is detected and triggers re-calibration
|
|
143
143
|
- [ ] You create and maintain custom profiles for your organization's artifact types
|
|
144
144
|
- [ ] Executive reporting provides portfolio-level quality visibility on a regular cadence
|
|
145
145
|
- [ ] Rubric updates follow a governed change process (version bumps, owner approval, re-calibration)
|
|
@@ -147,7 +147,7 @@ A rising first-pass Pass rate is the strongest signal that EAROS is working: art
|
|
|
147
147
|
|
|
148
148
|
## What Comes Next
|
|
149
149
|
|
|
150
|
-
Level 5 is not a destination
|
|
150
|
+
Level 5 is not a destination — it is a steady state of continuous improvement. From here:
|
|
151
151
|
|
|
152
152
|
- **Contribute back.** EAROS is open source. If you create profiles for artifact types that others would benefit from, consider contributing them to the project.
|
|
153
153
|
- **Share calibration data.** Cross-organizational calibration data strengthens the framework for everyone. Anonymized score distributions help improve the Wasserstein calibration baselines.
|