@trohde/earos 1.0.2 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,125 @@
1
+ # Agent-Assisted Evaluation
2
+
3
+ > **Level 3 to 4: Governed to Hybrid**
4
+
5
+ Your team produces governed, calibrated evaluations. Now you bring AI agents into the process --- not to replace human reviewers, but to provide an independent second perspective that strengthens every assessment.
6
+
7
+ ## What Changes at This Level
8
+
9
+ At Level 3, human reviewers follow the RULERS protocol with calibrated judgment. At Level 4, AI agents evaluate the same artifacts independently, and the two perspectives are reconciled into a single, stronger evaluation. This hybrid model consistently outperforms either approach alone: humans catch nuance that agents miss, while agents catch consistency issues and coverage gaps that humans overlook.
10
+
11
+ ## How AI Evaluation Works in EAROS
12
+
13
+ Agent evaluations follow an 8-step directed acyclic graph (DAG). Each step must complete before the next begins. No steps may be skipped.
14
+
15
+ ### The 8-Step DAG Evaluation Flow
16
+
17
+ **Step 1 --- Structural Validation.** The agent confirms the artifact conforms to its declared type. Does it have the expected sections? Is it machine-readable or does it require OCR? Can the agent identify the artifact's scope and purpose?
18
+
19
+ **Step 2 --- Content Extraction.** The agent identifies sections, diagrams, traceability elements, and key content areas. This builds a map of the artifact's structure before scoring begins.
20
+
21
+ **Step 3 --- Criterion Scoring.** The agent applies the RULERS protocol to each criterion: extract a direct quote or reference from the artifact as evidence, then match it against the `scoring_guide` level descriptors to assign a 0--4 score. If no evidence can be found, score N/A and explain why.
22
+
23
+ **Step 4 --- Cross-Reference Validation.** The agent checks consistency across views: do component names match across diagrams? Do interface definitions agree between the API contract and the sequence diagram? Are there contradictions between sections?
24
+
25
+ **Step 5 --- Dimension Aggregation.** The agent computes weighted dimension averages using the dimension weights defined in the rubric.
26
+
27
+ **Step 6 --- Challenge Pass.** A second perspective (another agent instance or a human) challenges the evaluator's highest and lowest scores. Are the highest scores supported by strong observed evidence, or are they inflated? Are the lowest scores genuinely that weak, or did the evaluator miss relevant content?
28
+
29
+ **Step 7 --- Calibration.** The agent aligns its score distribution to reference human distributions using the Wasserstein-based method (`rulers_wasserstein`). This prevents systematic over-scoring or under-scoring relative to human reviewers.
30
+
31
+ **Step 8 --- Status Determination.** Gates are checked first (critical gate failure equals Reject), then the weighted average is computed and applied against the status thresholds.
32
+
33
+ > **The DAG is not optional.** Skipping steps --- particularly the challenge pass (Step 6) --- undermines evaluation quality. An agent evaluation without a challenge pass is an unchecked evaluation.
34
+
35
+ ## Setting Up Agent Evaluation
36
+
37
+ ### With Claude Code
38
+
39
+ The `earos init` command scaffolds agent skills into `.claude/skills/` in your workspace. These are ready to use immediately:
40
+
41
+ ```bash
42
+ earos init my-workspace
43
+ cd my-workspace
44
+ ```
45
+
46
+ The workspace includes 10 EAROS skills. The three most relevant for agent evaluation are:
47
+
48
+ | Skill | Purpose |
49
+ |-------|---------|
50
+ | `earos-assess` | Primary evaluation --- runs the full 8-step DAG on any artifact |
51
+ | `earos-review` | Challenger --- audits an existing evaluation for over-scoring and unsupported claims |
52
+ | `earos-template-fill` | Author guide --- coaches artifact authors through writing assessment-ready documents |
53
+
54
+ ### With Other AI Agents
55
+
56
+ For Cursor, Copilot, Windsurf, and other AI tools, `earos init` also creates `.agents/skills/` with agent-agnostic skill files and an `AGENTS.md` at the workspace root. These provide the same evaluation capabilities without Claude-specific conventions.
57
+
58
+ ## Running Your First Agent Assessment
59
+
60
+ To run an agent assessment, provide the artifact and invoke the `earos-assess` skill. The agent will:
61
+
62
+ 1. Read the manifest to discover available profiles and overlays
63
+ 2. Ask you to confirm the artifact type and applicable overlays (or auto-detect them)
64
+ 3. Load the matching rubric (core + profile + overlays)
65
+ 4. Execute the full 8-step DAG
66
+ 5. Produce an evaluation record conforming to `evaluation.schema.json`
67
+
68
+ The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable --- you can trace each one back to the evidence that supports it.
69
+
70
+ ## The Hybrid Model
71
+
72
+ The hybrid model is the defining practice of Level 4. Here is how it works:
73
+
74
+ 1. **Independent evaluation.** The human reviewer and the AI agent evaluate the same artifact independently. Neither sees the other's scores during evaluation.
75
+
76
+ 2. **Score comparison.** After both evaluations are complete, compare results criterion by criterion.
77
+
78
+ 3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference --- go back to the `scoring_guide` level descriptors and determine which score more accurately reflects the evidence.
79
+
80
+ 4. **Reconciled record.** The final evaluation record captures both evaluators (mode: human and mode: agent) and notes where reconciliation occurred.
81
+
82
+ ### Why hybrid outperforms either approach alone
83
+
84
+ - **Agents** excel at consistency checking, exhaustive coverage, and pattern-matching against level descriptors. They do not get fatigued, skip criteria, or let familiarity bias inflate scores.
85
+ - **Humans** excel at contextual judgment, organizational knowledge, and assessing whether an architecture genuinely fits the business context. They catch subtleties that level descriptors cannot encode.
86
+ - **Together**, they produce evaluations with fewer false positives (over-scoring) and fewer false negatives (missed strengths).
87
+
88
+ ## The Challenge Pass
89
+
90
+ Step 6 of the DAG --- the challenge pass --- deserves special attention because it is the most commonly skipped step and the most valuable.
91
+
92
+ In the challenge pass, a second perspective reviews the evaluation and specifically targets:
93
+
94
+ - **The highest scores:** Is the evidence truly strong enough for a 4? Or was the agent being generous because the topic was mentioned, even if coverage was thin?
95
+ - **The lowest scores:** Did the evaluator genuinely find no evidence, or did they miss relevant content in a different section?
96
+ - **Gate-relevant criteria:** Are gate scores accurate? A single incorrect gate score can change the entire outcome.
97
+
98
+ The challenger records notes directly in the evaluation record. These notes are part of the audit trail.
99
+
100
+ ## Tracking Metrics
101
+
102
+ At Level 4, you begin tracking quantitative metrics for evaluation quality:
103
+
104
+ | Metric | Target | What It Tells You |
105
+ |--------|--------|------------------|
106
+ | **Cohen's kappa** (human-agent) | > 0.70 | Agreement between human and agent after calibration |
107
+ | **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation --- do human and agent agree on which criteria are strong vs. weak? |
108
+ | **Gate failure rate** | Track trend | How often critical or major gates fail, and for which criteria |
109
+ | **Score distribution** | Compare over time | Are scores clustering (suggesting rubber-stamping) or well-distributed? |
110
+
111
+ Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift --- time to re-calibrate.
112
+
113
+ ## Checkpoint: You Are at Level 4 When...
114
+
115
+ - [ ] AI agent evaluations follow the complete 8-step DAG with no steps skipped
116
+ - [ ] Every agent evaluation includes a challenge pass (Step 6)
117
+ - [ ] Human-agent disagreements of 2 or more points are routinely resolved against level descriptors
118
+ - [ ] You track inter-rater reliability metrics (kappa and/or Spearman's rho)
119
+ - [ ] Agent evaluations are auditable --- evidence anchors, evidence classes, and confidence are captured for every score
120
+
121
+ ## Next Steps
122
+
123
+ You now have a hybrid evaluation practice that combines human judgment with AI consistency. The final step is to scale this across your organization, integrate it into delivery pipelines, and build a continuous improvement loop.
124
+
125
+ Continue to [Scaling and Optimization](scaling-optimization.md).
@@ -0,0 +1,133 @@
1
+ # Your First Assessment
2
+
3
+ > **Level 1 to 2: Ad Hoc to Rubric-Based**
4
+
5
+ This guide walks you from zero to your first scored architecture evaluation. By the end, you will have installed EAROS, understood the core rubric, and completed a real assessment with evidence-backed scores.
6
+
7
+ ## What You Will Learn
8
+
9
+ - How to install the EAROS CLI and initialize a workspace
10
+ - The 9 dimensions and 10 criteria of the core meta-rubric
11
+ - How the 0--4 scoring scale works in practice
12
+ - How to find evidence in an artifact, cite it, and assign a score
13
+ - How gates work and when they override the average
14
+ - How to interpret your evaluation result
15
+
16
+ ## Prerequisites
17
+
18
+ You need one thing: **an architecture artifact to assess**. This can be any document your organization produces --- a solution design, an ADR, a capability map, a reference architecture, a roadmap. It does not need to be perfect; in fact, a flawed artifact is more instructive for a first assessment.
19
+
20
+ ## Installing the CLI
21
+
22
+ Install the EAROS CLI globally from npm:
23
+
24
+ ```bash
25
+ npm install -g @trohde/earos
26
+ ```
27
+
28
+ Then initialize a workspace in your project directory:
29
+
30
+ ```bash
31
+ earos init my-workspace
32
+ ```
33
+
34
+ This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained --- everything you need is scaffolded into the directory.
35
+
36
+ ## Understanding the Workspace
37
+
38
+ The `earos init` command creates a structured directory containing:
39
+
40
+ - **Rubric files** --- the core meta-rubric and all built-in profiles and overlays (YAML)
41
+ - **JSON schemas** --- for validating rubrics, evaluation records, and artifact documents
42
+ - **Agent skills** --- 10 pre-configured skills for AI-assisted evaluation (in `.claude/skills/` and `.agents/skills/`)
43
+ - **AGENTS.md** --- agent-agnostic instructions for AI tools like Cursor, Copilot, and Windsurf
44
+ - **Manifest** --- an inventory of all available rubrics
45
+
46
+ ## The Core Meta-Rubric
47
+
48
+ The core meta-rubric (`EAROS-CORE-002`) is the universal foundation. It applies to every architecture artifact regardless of type. It defines **9 dimensions** with **10 criteria**:
49
+
50
+ | Dimension | What It Assesses |
51
+ |-----------|-----------------|
52
+ | **D1: Stakeholder and purpose fit** | Does the artifact identify who it is for, why it exists, and what decision it supports? |
53
+ | **D2: Scope and boundary clarity** | Does it define what is in scope, out of scope, and the assumptions that underpin the design? |
54
+ | **D3: Concern coverage and viewpoint appropriateness** | Are the views and diagrams appropriate for the stated audience and purpose? |
55
+ | **D4: Traceability to drivers, requirements, and principles** | Can design choices be traced back to the business drivers that motivated them? |
56
+ | **D5: Internal consistency and integrity** | Are terms, structures, and interfaces consistent across all sections and views? |
57
+ | **D6: Risks, assumptions, constraints, and tradeoffs** | Does the artifact make trade-offs visible and document risks with mitigations? |
58
+ | **D7: Standards and policy compliance** | Does the design comply with mandatory organizational standards and controls? |
59
+ | **D8: Actionability and implementation relevance** | Can a delivery team act on this artifact without significant guesswork? |
60
+ | **D9: Artifact maintainability and stewardship** | Is the artifact versioned, owned, and structured so it can be maintained over time? |
61
+
62
+ For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact --- 10 criteria is manageable for a first pass.
63
+
64
+ ## The 0--4 Scoring Scale
65
+
66
+ Every criterion uses the same ordinal scale:
67
+
68
+ | Score | Label | What It Means in Practice |
69
+ |-------|-------|--------------------------|
70
+ | **4** | Strong | The criterion is fully addressed. You can point to specific, well-evidenced content that directly satisfies the requirement. No meaningful gaps. |
71
+ | **3** | Good | Clearly addressed with adequate evidence. Minor gaps exist but do not undermine the artifact's fitness for purpose. |
72
+ | **2** | Partial | The artifact explicitly addresses this area, but coverage is incomplete, inconsistent, or weakly evidenced. You can see the intent but not the execution. |
73
+ | **1** | Weak | The criterion is acknowledged or implied, but the treatment is inadequate for decision support. A reviewer would need to ask significant clarifying questions. |
74
+ | **0** | Absent | No meaningful evidence exists, or the evidence directly contradicts the criterion. |
75
+ | **N/A** | Not applicable | The criterion genuinely does not apply in this context. Every N/A must be justified. |
76
+
77
+ > **Key insight:** The difference between a 2 and a 3 is the difference between "they tried" and "they succeeded." A score of 2 means the author addressed the topic but left material gaps. A score of 3 means the content is adequate for its purpose with only minor improvements possible.
78
+
79
+ ## Walkthrough: Scoring Your First Artifact
80
+
81
+ Follow these steps for each of the 10 criteria:
82
+
83
+ 1. **Read the criterion question and scoring guide.** Open the core rubric YAML and find the criterion. Read the `question`, the `description`, and all five levels of the `scoring_guide`.
84
+
85
+ 2. **Search the artifact for evidence.** Look for the specific content the criterion requires. The `required_evidence` field tells you exactly what to look for (e.g., "purpose statement," "stakeholder list," "risk list with mitigations").
86
+
87
+ 3. **Record the evidence reference.** Write down where you found it: section number, page, diagram ID, or a direct quote. "Section 3 states: 'Primary stakeholders are the CTO and Head of Payments'" is valid evidence. "The artifact seems to address this" is not.
88
+
89
+ 4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure --- it provides IF/THEN logic for resolving ambiguous cases.
90
+
91
+ 5. **Move to the next criterion.** Repeat for all 10 criteria.
92
+
93
+ ## Understanding Gates
94
+
95
+ Not all criteria are equal. Some have **gates** --- threshold controls that can block a passing status regardless of how well you score on everything else.
96
+
97
+ | Gate Severity | What Happens |
98
+ |---------------|-------------|
99
+ | **Critical** | If the score is below the gate threshold, the artifact is automatically **Rejected**. No amount of high scores elsewhere can override this. |
100
+ | **Major** | A low score caps the maximum achievable status (e.g., cannot pass above Conditional Pass). |
101
+ | **Advisory** | A low score triggers a recommendation but does not block any status. |
102
+
103
+ In the core rubric, **SCP-01** (Scope and boundary clarity) has a **critical** gate: if the scope is so unclear that the artifact cannot be reviewed (score < 2), the result is "Not Reviewable" regardless of all other scores. **STK-01** (Stakeholder and purpose fit) and **TRC-01** (Traceability) have **major** gates.
104
+
105
+ > **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop --- the result is Reject. Only then compute the weighted average for the remaining status thresholds.
106
+
107
+ ## Interpreting Your Results
108
+
109
+ After scoring all criteria and checking gates, compute the weighted average across dimensions and apply the status thresholds:
110
+
111
+ | Status | Threshold |
112
+ |--------|-----------|
113
+ | **Pass** | No critical gate failure, overall average >= 3.2, and no dimension average < 2.0 |
114
+ | **Conditional Pass** | No critical gate failure, overall average 2.4--3.19 (weaknesses are containable with named actions) |
115
+ | **Rework Required** | Overall average < 2.4, or repeated weak dimensions, or insufficient evidence |
116
+ | **Reject** | Any critical gate failure, or mandatory control breach |
117
+ | **Not Reviewable** | Evidence too incomplete to score responsibly |
118
+
119
+ A Conditional Pass is not a failure --- it means the artifact is close but needs specific, named improvements before it is decision-ready. Record those improvements as actions in the evaluation record.
120
+
121
+ ## Checkpoint: You Are at Level 2 When...
122
+
123
+ - [ ] You have completed at least one assessment using the core meta-rubric
124
+ - [ ] Every score has a cited evidence reference --- not "seems adequate" but a specific section, page, or quote
125
+ - [ ] You can explain the difference between a score of 2 and a score of 3 for any criterion
126
+ - [ ] You understand which gates would block a Pass status and why
127
+ - [ ] Your evaluation result includes a status determination (Pass, Conditional Pass, Rework Required, Reject, or Not Reviewable)
128
+
129
+ ## Next Steps
130
+
131
+ You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process --- with artifact-specific profiles, cross-cutting overlays, and calibrated scoring.
132
+
133
+ Continue to [Governed Review](governed-review.md).
@@ -0,0 +1,128 @@
1
+ # Governed Review
2
+
3
+ > **Level 2 to 3: Rubric-Based to Governed**
4
+
5
+ You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice --- with artifact-specific profiles, context-driven overlays, calibrated teams, and evidence-anchored scoring that is reproducible across your organization.
6
+
7
+ ## What Changes at This Level
8
+
9
+ At Level 2, you used the core rubric and produced evidence-backed scores. At Level 3, three things change:
10
+
11
+ 1. **Artifact-specific profiles** add criteria tailored to the type of artifact being reviewed
12
+ 2. **Overlays** inject cross-cutting concerns (security, data governance, regulatory) based on context
13
+ 3. **Calibration** ensures that different reviewers on your team produce substantially similar scores for the same artifact
14
+
15
+ ## Choosing a Profile
16
+
17
+ The core meta-rubric's 10 criteria are universal --- they apply to every architecture artifact. But a reference architecture has different quality expectations than an ADR, and a capability map is evaluated differently than a roadmap. Profiles add 5--12 artifact-specific criteria on top of the core.
18
+
19
+ | Profile | Artifact Type | What It Adds |
20
+ |---------|--------------|-------------|
21
+ | `solution-architecture.yaml` | Solution designs, HLDs, LLDs | Implementation specificity, integration patterns, deployment readiness |
22
+ | `reference-architecture.yaml` | Reference architectures, platform blueprints | Architectural views, pattern reusability, adoption guidance, evolution strategy |
23
+ | `adr.yaml` | Architecture Decision Records | Decision context, options analysis, consequence tracking, revisit triggers |
24
+ | `capability-map.yaml` | Capability maps, business architecture | Capability decomposition, business alignment, gap analysis |
25
+ | `roadmap.yaml` | Architecture roadmaps, transition plans | Sequencing, dependency management, milestone definition, risk on timeline |
26
+
27
+ Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria --- typically 19--21 criteria total.
28
+
29
+ > **How to choose:** Match the profile to the artifact's declared type. If the artifact does not fit any built-in profile, use the core rubric alone. Creating custom profiles is covered in [Scaling and Optimization](scaling-optimization.md).
30
+
31
+ ## Applying Overlays
32
+
33
+ Overlays inject cross-cutting concerns that apply across artifact types. Unlike profiles, overlays are applied **based on context**, not based on artifact type.
34
+
35
+ | Overlay | Apply When... |
36
+ |---------|--------------|
37
+ | **Security** (`security.yaml`) | The design touches authentication, authorization, encryption, personal data, or network boundaries |
38
+ | **Data Governance** (`data-governance.yaml`) | The artifact describes data flows, data retention, data classification, or data lineage |
39
+ | **Regulatory** (`regulatory.yaml`) | The artifact operates in a regulated domain: payments, healthcare, financial reporting, privacy |
40
+
41
+ Overlays are additive --- they append criteria to the base rubric (core + profile). They cannot remove or weaken gates from the base. An overlay's critical gate adds to the gate model; it does not replace it.
42
+
43
+ You can apply multiple overlays simultaneously. A payments solution architecture might use the solution-architecture profile with both the security and regulatory overlays.
44
+
45
+ ## The RULERS Protocol
46
+
47
+ RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring) is the discipline that makes scores reproducible. The core principle is simple:
48
+
49
+ **Extract the evidence first. Then assign the score.**
50
+
51
+ For each criterion:
52
+
53
+ 1. Search the artifact for content that addresses the criterion
54
+ 2. If you find it, record the evidence anchor: a direct quote, section reference, or diagram ID
55
+ 3. Then --- and only then --- match the evidence against the `scoring_guide` level descriptors
56
+ 4. If you cannot find evidence, record N/A and explain why the criterion does not apply, or score 0 and note the absence
57
+
58
+ Never score from impression. "The artifact seems to address security" is not evidence. "Section 7.2 states: 'All inter-service communication uses mTLS with certificates rotated every 90 days'" is evidence.
59
+
60
+ ## Evidence Classes
61
+
62
+ Every piece of evidence you cite must be classified:
63
+
64
+ | Class | Definition | Credibility |
65
+ |-------|-----------|-------------|
66
+ | **Observed** | Directly supported by a quote or excerpt from the artifact | Highest |
67
+ | **Inferred** | A reasonable interpretation of content that is not directly stated | Medium |
68
+ | **External** | Judgment based on a standard, policy, or source outside the artifact | Lowest |
69
+
70
+ Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps --- which is itself a finding worth recording.
71
+
72
+ ## The Three Evaluation Types
73
+
74
+ EAROS distinguishes three distinct judgment types that must never be merged into a single score:
75
+
76
+ | Type | Question It Answers | Example |
77
+ |------|-------------------|---------|
78
+ | **Artifact quality** | Is the document complete, coherent, clear, traceable, and fit for purpose? | "The document is well-structured but missing a deployment view" |
79
+ | **Architectural fitness** | Does the described architecture appear sound relative to business drivers, quality attributes, and risks? | "The architecture lacks a caching strategy despite a sub-200ms latency requirement" |
80
+ | **Governance fit** | Does the artifact comply with mandatory principles, standards, controls, and review expectations? | "The design uses a non-approved message broker without an exception record" |
81
+
82
+ These are related but distinct. A beautifully written, complete document can describe an architecturally unsound system. A technically excellent architecture can be documented in an unmaintainable artifact. Collapsing these into one score hides critical information.
83
+
84
+ ## Calibrating with Your Team
85
+
86
+ Calibration is what transforms individual scoring into a team capability. Without it, "a score of 3" means something different to each reviewer.
87
+
88
+ ### Step-by-step calibration exercise
89
+
90
+ 1. **Select 3--5 representative artifacts.** Aim for diversity: one strong artifact, one weak, one ambiguous, and one incomplete. The gold-standard example at `examples/aws-event-driven-order-processing/` is an excellent starting point.
91
+
92
+ 2. **Have 2+ reviewers score independently.** Each reviewer scores the same artifact against the same rubric without discussing their scores.
93
+
94
+ 3. **Compare scores criterion by criterion.** Build a comparison table showing each reviewer's score for each criterion.
95
+
96
+ 4. **Identify disagreements.** Any disagreement greater than 1 point on the same criterion requires discussion.
97
+
98
+ 5. **Resolve against level descriptors.** Do not compromise to a middle score. Go back to the `scoring_guide` and determine which level descriptor most accurately matches the evidence. Update your shared understanding of what each level means.
99
+
100
+ 6. **Compute Cohen's kappa.** This measures inter-rater reliability. Target kappa > 0.70 (substantial agreement) for well-defined criteria and > 0.50 for more subjective criteria. If you fall short, repeat the exercise with a different artifact.
101
+
102
+ 7. **Update decision trees.** Where disagreements clustered, refine the `decision_tree` for those criteria to reduce future ambiguity.
103
+
104
+ ## Setting Up a Review Cadence
105
+
106
+ Governed review is not a one-time activity. Establish a regular rhythm:
107
+
108
+ - **Architecture review board** meets on a defined schedule (weekly, biweekly, or per-milestone)
109
+ - **Evaluation records** are created for every reviewed artifact, conforming to `evaluation.schema.json`
110
+ - **Actions from Conditional Pass** results are tracked to completion with named owners and deadlines
111
+ - **Re-calibration** happens quarterly or when new team members join the review rotation
112
+
113
+ For more on evaluation record structure, see the [Getting Started guide](../getting-started.md). For terminology definitions (Cohen's kappa, evidence class, RULERS), see the [Terminology glossary](../terminology.md).
114
+
115
+ ## Checkpoint: You Are at Level 3 When...
116
+
117
+ - [ ] Your team uses a matching profile (not just the core rubric) for every assessment
118
+ - [ ] Every score uses the RULERS protocol --- evidence anchor first, then score
119
+ - [ ] You have completed a calibration exercise with kappa > 0.70
120
+ - [ ] Overlays are applied based on context (not arbitrarily or never)
121
+ - [ ] Evaluation records are structured and conform to `evaluation.schema.json`
122
+ - [ ] The three evaluation types (artifact quality, architectural fitness, governance fit) are reported separately in every evaluation
123
+
124
+ ## Next Steps
125
+
126
+ Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process --- not to replace human judgment, but to augment it with a second independent perspective.
127
+
128
+ Continue to [Agent-Assisted Evaluation](agent-assisted.md).
@@ -0,0 +1,127 @@
1
+ # EAROS Adoption Maturity Model
2
+
3
+ ## Why Staged Adoption Matters
4
+
5
+ Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem --- it is a capability maturity problem.
6
+
7
+ The EAROS Adoption Maturity Model draws on three decades of maturity research:
8
+
9
+ - **CMMI** (Capability Maturity Model Integration) established the 5-level progression from initial/ad hoc to optimizing, demonstrating that process maturity is built incrementally.
10
+ - **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement --- not tooling alone.
11
+ - **OMB EAAF** (Enterprise Architecture Assessment Framework) showed that federal agencies succeed when they build capability in stages aligned to organizational readiness.
12
+ - **TOGAF ACMM** (Architecture Capability Maturity Model) provided the architecture-specific framing: maturity grows from informal practices through defined processes to measured and optimized operations.
13
+
14
+ EAROS applies these lessons to a specific domain: architecture artifact evaluation. Each level builds on the previous one. Skip a level and you build on sand.
15
+
16
+ ## The Five Levels
17
+
18
+ ### Level 1 --- Ad Hoc
19
+
20
+ No formal review process. Evaluation quality depends entirely on who happens to review the artifact. Different reviewers apply different mental models, and feedback is inconsistent and unreproducible.
21
+
22
+ - **Key practices:** Informal peer review, tribal knowledge
23
+ - **EAROS capabilities:** None (this is the baseline state)
24
+ - **You are here when:** You recognize the problem --- reviews are inconsistent and reviewer-dependent
25
+
26
+ ---
27
+
28
+ ### Level 2 --- Rubric-Based
29
+
30
+ The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 0--4 scoring scale. Evidence is cited for every score. Results are reproducible across reviewers.
31
+
32
+ - **Key practices:** Manual scoring against core meta-rubric, evidence citation for every score, gate checking
33
+ - **EAROS capabilities:** Core meta-rubric, scoring sheets, 0--4 scale
34
+ - **You are here when:** You have completed at least one assessment using the core rubric with evidence for every score
35
+
36
+ > **Guide:** [Your First Assessment](first-assessment.md) walks you through this transition.
37
+
38
+ ---
39
+
40
+ ### Level 3 --- Governed
41
+
42
+ Artifact-specific profiles and context-driven overlays are in use. Teams are calibrated against reference examples. The RULERS protocol ensures evidence-anchored scoring. Evaluation records are structured and auditable.
43
+
44
+ - **Key practices:** Profile and overlay selection, RULERS protocol, team calibration exercises, structured evaluation records
45
+ - **EAROS capabilities:** Profiles, overlays, calibration, RULERS protocol, evidence classes
46
+ - **You are here when:** Your team has completed a calibration exercise with Cohen's kappa > 0.70 and profiles are matched to artifact types
47
+
48
+ > **Guide:** [Governed Review](governed-review.md) walks you through this transition.
49
+
50
+ ---
51
+
52
+ ### Level 4 --- Hybrid
53
+
54
+ AI agents augment human reviewers. Both evaluate independently and reconcile disagreements against level descriptors. Metrics track inter-rater reliability between human and agent evaluators.
55
+
56
+ - **Key practices:** Human + AI agent independent evaluation, challenge pass, disagreement resolution, reliability tracking
57
+ - **EAROS capabilities:** Agent skills, DAG evaluation flow, challenge pass, hybrid mode
58
+ - **You are here when:** Human-agent disagreements are routinely resolved and inter-rater reliability is tracked
59
+
60
+ > **Guide:** [Agent-Assisted Evaluation](agent-assisted.md) walks you through this transition.
61
+
62
+ ---
63
+
64
+ ### Level 5 --- Optimized
65
+
66
+ Architecture evaluation is continuous and integrated into delivery workflows. Calibration happens automatically. Executive reporting provides portfolio-level quality visibility. Rubrics are governed assets with version control and change management.
67
+
68
+ - **Key practices:** CI/CD integration, continuous calibration, custom profile creation, executive reporting, fitness functions
69
+ - **EAROS capabilities:** Executive reporting, fitness functions, Wasserstein calibration, custom profiles
70
+ - **You are here when:** Evaluation is integrated into pipelines, calibration drift is monitored, and portfolio-level reporting is operational
71
+
72
+ > **Guide:** [Scaling and Optimization](scaling-optimization.md) walks you through this transition.
73
+
74
+ ## Where Are You Today?
75
+
76
+ Use this self-assessment to identify your current level. Answer each question honestly --- the goal is to find your starting point, not to score well.
77
+
78
+ ### Level 1 -- Ad Hoc (you are here if most answers are "no")
79
+
80
+ - [ ] Do you have a written set of criteria for architecture review?
81
+ - [ ] Do two reviewers produce substantially similar feedback on the same artifact?
82
+ - [ ] Can you explain what "good" looks like for an architecture artifact in your organization?
83
+
84
+ ### Level 2 -- Rubric-Based (you are here if you can answer "yes" to these)
85
+
86
+ - [ ] You use a defined rubric with explicit criteria and scoring levels
87
+ - [ ] Every score has a cited evidence reference (not just "seems adequate")
88
+ - [ ] You can explain the difference between a score of 2 and 3 for any criterion
89
+ - [ ] You check gates before computing averages
90
+
91
+ ### Level 3 -- Governed (you are here if you can answer "yes" to these)
92
+
93
+ - [ ] You select profiles matched to artifact types (not just the core rubric)
94
+ - [ ] You apply overlays based on context (security, data governance, regulatory)
95
+ - [ ] Your team has completed a calibration exercise with inter-rater agreement measured
96
+ - [ ] Evidence is classified as observed, inferred, or external
97
+ - [ ] Artifact quality, architectural fitness, and governance fit are reported separately
98
+
99
+ ### Level 4 -- Hybrid (you are here if you can answer "yes" to these)
100
+
101
+ - [ ] AI agents evaluate artifacts using the full 8-step DAG evaluation flow
102
+ - [ ] Human and agent evaluations are compared and reconciled
103
+ - [ ] A challenge pass reviews the highest and lowest scores for every evaluation
104
+ - [ ] You track inter-rater reliability metrics between human and agent evaluators
105
+
106
+ ### Level 5 -- Optimized (you are here if you can answer "yes" to these)
107
+
108
+ - [ ] Architecture evaluation is integrated into your CI/CD or delivery pipeline
109
+ - [ ] Calibration runs continuously, not just at setup time
110
+ - [ ] You create and maintain custom profiles for your organization's artifact types
111
+ - [ ] Executive reporting provides portfolio-level quality visibility
112
+ - [ ] Rubric changes follow a governed process with version bumps and re-calibration
113
+
114
+ ## How to Use This Guide
115
+
116
+ The onboarding guide is organized as five pages, one for each level transition:
117
+
118
+ 1. [Your First Assessment](first-assessment.md) --- Level 1 to 2: Ad Hoc to Rubric-Based
119
+ 2. [Governed Review](governed-review.md) --- Level 2 to 3: Rubric-Based to Governed
120
+ 3. [Agent-Assisted Evaluation](agent-assisted.md) --- Level 3 to 4: Governed to Hybrid
121
+ 4. [Scaling and Optimization](scaling-optimization.md) --- Level 4 to 5: Hybrid to Optimized
122
+
123
+ **Sequential reading is recommended.** Each guide builds on concepts introduced in the previous one. However, if you already know your current level from the self-assessment above, you can jump directly to the guide for your next transition.
124
+
125
+ > **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation --- everything you need to move from ad hoc to rubric-based review.
126
+
127
+ For deeper reference material, see the [Getting Started guide](../getting-started.md), the [Terminology glossary](../terminology.md), and the full EAROS standard in `standard/EAROS.md`.
@@ -0,0 +1,156 @@
1
+ # Scaling and Optimization
2
+
3
+ > **Level 4 to 5: Hybrid to Optimized**
4
+
5
+ Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability --- integrated into delivery workflows, continuously calibrated, and visible to leadership.
6
+
7
+ ## What Changes at This Level
8
+
9
+ At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers --- triggered automatically, calibrated continuously, and reported to stakeholders who never touch a rubric YAML.
10
+
11
+ ## CI/CD Integration
12
+
13
+ Architecture fitness functions are automated tests that verify an architecture meets quality attribute targets. At Level 5, EAROS evaluation becomes one of those fitness functions.
14
+
15
+ ### Pre-merge gate
16
+
17
+ Embed an EAROS evaluation as a quality gate in your pull request or merge request pipeline. When an architecture artifact is modified:
18
+
19
+ 1. The pipeline detects the change (e.g., a modified `artifact.yaml` or solution design document)
20
+ 2. An agent evaluation runs against the matching rubric (core + profile + applicable overlays)
21
+ 3. If the status is **Reject** or **Not Reviewable**, the merge is blocked
22
+ 4. If the status is **Conditional Pass**, the merge proceeds but actions are created automatically
23
+ 5. If the status is **Pass**, the merge proceeds cleanly
24
+
25
+ ### Post-merge quality tracking
26
+
27
+ After merge, record evaluation results in a time-series store. This enables trend analysis: is artifact quality improving or degrading over time? Which dimensions are consistently weak? Which teams produce the strongest artifacts?
28
+
29
+ ### Architecture as code
30
+
31
+ Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this --- artifacts conforming to `artifact.schema.json` can be validated, scored, and tracked automatically. Encourage teams to adopt structured artifact formats (YAML with frontmatter, ArchiMate exchange, diagram-as-code) rather than unstructured documents.
32
+
33
+ ## Continuous Calibration
34
+
35
+ At earlier levels, calibration is an event --- a scheduled exercise where reviewers score reference artifacts and compare results. At Level 5, calibration becomes continuous.
36
+
37
+ ### Wasserstein-based alignment
38
+
39
+ The RULERS protocol uses the Wasserstein distance to measure how far an agent's score distribution has drifted from the reference human distribution. When the distance exceeds a threshold, re-calibration is triggered automatically.
40
+
41
+ ### Calibration triggers
42
+
43
+ - **Profile update:** When a profile's criteria, scoring guides, or decision trees are modified, re-calibrate against the gold set before production use
44
+ - **Agent model change:** When the underlying AI model is updated, run the calibration suite to verify score alignment
45
+ - **Drift detection:** Monitor Wasserstein distance over rolling evaluation windows; alert when it exceeds your threshold
46
+
47
+ ### Maintaining the gold set
48
+
49
+ The gold set (`calibration/gold-set/`) contains reference artifacts with known scores. As your organization's standards evolve, the gold set must evolve too. Add new reference artifacts periodically and retire outdated ones. Every gold-set artifact needs scores from at least 2 calibrated human reviewers.
50
+
51
+ ## Creating Custom Profiles
52
+
53
+ The five built-in profiles (solution-architecture, reference-architecture, adr, capability-map, roadmap) cover the most common artifact types. Your organization likely has others: integration designs, data architecture documents, migration plans, platform specifications, API governance records.
54
+
55
+ ### The 6-step process
56
+
57
+ 1. **Qualify the need.** The artifact type must recur enough to justify standardization, and the core rubric alone must be insufficient.
58
+
59
+ 2. **Choose a design method.** EAROS defines five approaches: decision-centred (A), viewpoint-centred (B), lifecycle-centred (C), risk-centred (D), and pattern-library (E). Select the one that matches your artifact's primary concern.
60
+
61
+ 3. **Start from the template.** Copy `templates/new-profile.template.yaml` and set the required fields: `kind: profile`, `inherits: [EAROS-CORE-002]`, and `design_method`.
62
+
63
+ 4. **Write 5--12 criteria.** Each criterion needs all required fields: `question`, `description`, `scoring_guide` (all 5 levels), `required_evidence`, `anti_patterns`, `examples.good`, `examples.bad`, `decision_tree`, and `remediation_hints`.
64
+
65
+ 5. **Calibrate before production.** Score 3--5 representative artifacts with 2+ reviewers. Target kappa > 0.70.
66
+
67
+ 6. **Publish.** Validate against `rubric.schema.json`, add to the manifest, and document in the changelog.
68
+
69
+ For detailed authoring guidance, see the [Profile Authoring Guide](../profile-authoring-guide.md). The `earos-create` skill can walk you through the entire process interactively.
70
+
71
+ ## Organizational Rollout
72
+
73
+ ### Training
74
+
75
+ Use the maturity model itself as a training roadmap. New team members start at Level 1 and progress through the guides:
76
+
77
+ - **Week 1:** Complete [Your First Assessment](first-assessment.md) --- score a real artifact against the core rubric
78
+ - **Week 2:** Complete [Governed Review](governed-review.md) --- join a calibration exercise, learn profiles and overlays
79
+ - **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md) --- run a hybrid evaluation and reconcile disagreements
80
+ - **Ongoing:** Participate in review rotations and calibration exercises
81
+
82
+ ### Governance
83
+
84
+ Rubrics are governed assets at Level 5. This means:
85
+
86
+ - **Version control:** All rubric changes go through pull requests with peer review
87
+ - **Owner approval:** Each rubric has a designated owner (typically a principal or lead architect) who approves changes
88
+ - **Semver versioning:** Major changes (scoring model, gate structure) bump the major version; new criteria bump minor; documentation fixes bump patch
89
+ - **Re-calibration:** Every rubric change that affects scoring requires re-calibration before the updated rubric enters production use
90
+
91
+ ### Culture
92
+
93
+ The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate --- a hoop to jump through before deployment --- adoption will be grudging and superficial.
94
+
95
+ Position EAROS as a quality tool, not a gatekeeping tool:
96
+
97
+ - **For authors:** EAROS tells you exactly what a good artifact looks like before you write it. Use the `earos-template-fill` skill or the scoring guide to understand expectations upfront.
98
+ - **For reviewers:** EAROS makes review faster and more consistent. You spend less time debating what "good" means and more time giving actionable feedback.
99
+ - **For leadership:** EAROS provides measurable, comparable quality data across the portfolio. It replaces "we think our architecture practice is mature" with evidence.
100
+
101
+ ### Scaling across teams
102
+
103
+ Start with a pilot team that is motivated and willing to iterate. Let them reach Level 3 before expanding. Use their calibration data, evaluation records, and lessons learned to accelerate the next team's adoption. Do not mandate enterprise-wide adoption on day one.
104
+
105
+ ## Executive Reporting
106
+
107
+ The `earos-report` skill generates portfolio-level views from evaluation records. At Level 5, these reports become a regular governance input.
108
+
109
+ ### What executive reports provide
110
+
111
+ - **Traffic-light dashboards:** Red/amber/green status for each evaluated artifact, grouped by team, domain, or portfolio
112
+ - **Dimension trends:** Which quality dimensions are improving or declining across the portfolio over time
113
+ - **Gate failure hotspots:** Which criteria most frequently trigger gate failures --- these are systemic weaknesses worth investing in
114
+ - **Remediation tracking:** Status of actions from Conditional Pass evaluations --- are they being completed?
115
+
116
+ ### Aggregating across the portfolio
117
+
118
+ Individual evaluations tell you about one artifact. Aggregated evaluations tell you about your architecture practice. Track:
119
+
120
+ - How many artifacts achieve Pass vs. Conditional Pass vs. Rework Required
121
+ - Which teams consistently produce higher-quality artifacts (and what practices they follow)
122
+ - Whether quality improves after remediation actions are completed
123
+
124
+ ## Measuring Adoption
125
+
126
+ At Level 5, you track KPIs for architecture review maturity itself:
127
+
128
+ | KPI | What It Measures | Target Direction |
129
+ |-----|-----------------|-----------------|
130
+ | **Evaluation throughput** | Number of evaluations completed per month | Increasing (more artifacts reviewed) |
131
+ | **Time-to-review** | Elapsed time from artifact submission to completed evaluation | Decreasing (faster feedback) |
132
+ | **Inter-rater reliability** | Kappa between human and agent evaluators | Stable above 0.70 |
133
+ | **Remediation completion rate** | Percentage of Conditional Pass actions completed within deadline | Increasing |
134
+ | **Gate failure rate** | Percentage of evaluations triggering critical or major gate failures | Decreasing (artifacts improving) |
135
+ | **First-pass Pass rate** | Percentage of artifacts achieving Pass on first evaluation | Increasing (authors learning expectations) |
136
+
137
+ A rising first-pass Pass rate is the strongest signal that EAROS is working: artifact authors are internalizing quality expectations and producing better work upfront.
138
+
139
+ ## Checkpoint: You Are at Level 5 When...
140
+
141
+ - [ ] Architecture evaluation is integrated into your CI/CD or delivery pipeline
142
+ - [ ] Calibration happens continuously, not just at setup time --- drift is detected and triggers re-calibration
143
+ - [ ] You create and maintain custom profiles for your organization's artifact types
144
+ - [ ] Executive reporting provides portfolio-level quality visibility on a regular cadence
145
+ - [ ] Rubric updates follow a governed change process (version bumps, owner approval, re-calibration)
146
+ - [ ] Architecture review is perceived as a quality enabler, not a bureaucratic gate
147
+
148
+ ## What Comes Next
149
+
150
+ Level 5 is not a destination --- it is a steady state of continuous improvement. From here:
151
+
152
+ - **Contribute back.** EAROS is open source. If you create profiles for artifact types that others would benefit from, consider contributing them to the project.
153
+ - **Share calibration data.** Cross-organizational calibration data strengthens the framework for everyone. Anonymized score distributions help improve the Wasserstein calibration baselines.
154
+ - **Evolve the standard.** As your organization's architecture practice matures, your evaluation needs will evolve. Propose improvements to the EAROS standard through the governance process.
155
+
156
+ For the full EAROS standard, see `standard/EAROS.md`. For terminology, see the [Terminology glossary](../terminology.md). For technical reference on creating profiles, see the [Profile Authoring Guide](../profile-authoring-guide.md).
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@trohde/earos",
3
- "version": "1.0.2",
3
+ "version": "1.1.1",
4
4
  "description": "Schema-driven editor and CLI for EaROS architecture assessment rubrics and evaluations",
5
5
  "type": "module",
6
6
  "bin": {