npm - @trohde/earos - Versions diffs - 1.2.0 → 1.3.0 - Mend

@trohde/earos 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (69) hide show

package/assets/init/docs/getting-started.md CHANGED Viewed

@@ -24,7 +24,7 @@ EaROS has profiles for the most common enterprise architecture artifact types:
 | Architecture Decision Record (ADR) | `profiles/adr.yaml` | Approved |
 | Capability map | `profiles/capability-map.yaml` | Approved |
 | Architecture roadmap | `profiles/roadmap.yaml` | Draft |
-| Other / unknown | Core only: `core/core-meta-rubric.yaml` | --- |
+| Other / unknown | Core only: `core/core-meta-rubric.yaml` | — |
 > **Status:** *Approved* profiles have completed calibration. *Draft* profiles are usable but have not yet been calibrated with inter-rater reliability measured. Check `earos.manifest.yaml` for the latest status of each rubric.

package/assets/init/docs/onboarding/agent-assisted.md CHANGED Viewed

@@ -2,7 +2,7 @@
 > **Level 3 to 4: Governed to Hybrid**
-Your team produces governed, calibrated evaluations. Now you bring AI agents into the process --- not to replace human reviewers, but to provide an independent second perspective that strengthens every assessment.
+Your team produces governed, calibrated evaluations. Now you bring AI agents into the process — not to replace human reviewers, but to provide an independent second perspective that strengthens every assessment.
 ## What Changes at This Level
@@ -14,23 +14,23 @@ Agent evaluations follow an 8-step directed acyclic graph (DAG). Each step must
 ### The 8-Step DAG Evaluation Flow
-**Step 1 --- Structural Validation.** The agent confirms the artifact conforms to its declared type. Does it have the expected sections? Is it machine-readable or does it require OCR? Can the agent identify the artifact's scope and purpose?
+**Step 1 — Structural Validation.** The agent confirms the artifact conforms to its declared type. Does it have the expected sections? Is it machine-readable or does it require OCR? Can the agent identify the artifact's scope and purpose?
-**Step 2 --- Content Extraction.** The agent identifies sections, diagrams, traceability elements, and key content areas. This builds a map of the artifact's structure before scoring begins.
+**Step 2 — Content Extraction.** The agent identifies sections, diagrams, traceability elements, and key content areas. This builds a map of the artifact's structure before scoring begins.
-**Step 3 --- Criterion Scoring.** The agent applies the RULERS protocol to each criterion: extract a direct quote or reference from the artifact as evidence, then match it against the `scoring_guide` level descriptors to assign a 0--4 score. If no evidence can be found, score N/A and explain why.
+**Step 3 — Criterion Scoring.** The agent applies the RULERS protocol to each criterion: extract a direct quote or reference from the artifact as evidence, then match it against the `scoring_guide` level descriptors to assign a 0–4 score. If no evidence can be found, score N/A and explain why.
-**Step 4 --- Cross-Reference Validation.** The agent checks consistency across views: do component names match across diagrams? Do interface definitions agree between the API contract and the sequence diagram? Are there contradictions between sections?
+**Step 4 — Cross-Reference Validation.** The agent checks consistency across views: do component names match across diagrams? Do interface definitions agree between the API contract and the sequence diagram? Are there contradictions between sections?
-**Step 5 --- Dimension Aggregation.** The agent computes weighted dimension averages using the dimension weights defined in the rubric.
+**Step 5 — Dimension Aggregation.** The agent computes weighted dimension averages using the dimension weights defined in the rubric.
-**Step 6 --- Challenge Pass.** A second perspective (another agent instance or a human) challenges the evaluator's highest and lowest scores. Are the highest scores supported by strong observed evidence, or are they inflated? Are the lowest scores genuinely that weak, or did the evaluator miss relevant content?
+**Step 6 — Challenge Pass.** A second perspective (another agent instance or a human) challenges the evaluator's highest and lowest scores. Are the highest scores supported by strong observed evidence, or are they inflated? Are the lowest scores genuinely that weak, or did the evaluator miss relevant content?
-**Step 7 --- Calibration.** The agent aligns its score distribution to reference human distributions using the Wasserstein-based method (`rulers_wasserstein`). This prevents systematic over-scoring or under-scoring relative to human reviewers.
+**Step 7 — Calibration.** The agent aligns its score distribution to reference human distributions using the Wasserstein-based method (`rulers_wasserstein`). This prevents systematic over-scoring or under-scoring relative to human reviewers.
-**Step 8 --- Status Determination.** Gates are checked first (a critical gate failure blocks a passing status --- the specific outcome, `Reject` or `Not Reviewable`, is determined by the criterion's `failure_effect`), then the weighted average is computed and applied against the status thresholds.
+**Step 8 — Status Determination.** Gates are checked first (a critical gate failure blocks a passing status — the specific outcome, `Reject` or `Not Reviewable`, is determined by the criterion's `failure_effect`), then the weighted average is computed and applied against the status thresholds.
-> **The DAG is not optional.** Skipping steps --- particularly the challenge pass (Step 6) --- undermines evaluation quality. An agent evaluation without a challenge pass is an unchecked evaluation.
+> **The DAG is not optional.** Skipping steps — particularly the challenge pass (Step 6) — undermines evaluation quality. An agent evaluation without a challenge pass is an unchecked evaluation.
 ## Setting Up Agent Evaluation
@@ -47,9 +47,9 @@ The workspace includes 10 EAROS skills. The three most relevant for agent evalua
 | Skill | Purpose |
 |-------|---------|
-| `earos-assess` | Primary evaluation --- runs the full 8-step DAG on any artifact |
-| `earos-review` | Challenger --- audits an existing evaluation for over-scoring and unsupported claims |
-| `earos-template-fill` | Author guide --- coaches artifact authors through writing assessment-ready documents |
+| `earos-assess` | Primary evaluation — runs the full 8-step DAG on any artifact |
+| `earos-review` | Challenger — audits an existing evaluation for over-scoring and unsupported claims |
+| `earos-template-fill` | Author guide — coaches artifact authors through writing assessment-ready documents |
 ### With Other AI Agents
@@ -65,7 +65,7 @@ To run an agent assessment, provide the artifact and invoke the `earos-assess` s
 4. Execute the full 8-step DAG
 5. Produce an evaluation record conforming to `evaluation.schema.json`
-The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable --- you can trace each one back to the evidence that supports it.
+The output includes scores, evidence anchors, evidence classes (observed/inferred/external), confidence levels (high/medium/low), and a status determination. Every score is auditable — you can trace each one back to the evidence that supports it.
 ## The Hybrid Model
@@ -75,7 +75,7 @@ The hybrid model is the defining practice of Level 4. Here is how it works:
 2. **Score comparison.** After both evaluations are complete, compare results criterion by criterion.
-3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference --- go back to the `scoring_guide` level descriptors and determine which score more accurately reflects the evidence.
+3. **Disagreement resolution.** Any disagreement of 2 or more points on the same criterion must be resolved. Do not split the difference — go back to the `scoring_guide` level descriptors and determine which score more accurately reflects the evidence.
 4. **Reconciled record.** The final evaluation record captures both evaluators (mode: human and mode: agent) and notes where reconciliation occurred.
@@ -87,7 +87,7 @@ The hybrid model is the defining practice of Level 4. Here is how it works:
 ## The Challenge Pass
-Step 6 of the DAG --- the challenge pass --- deserves special attention because it is the most commonly skipped step and the most valuable.
+Step 6 of the DAG — the challenge pass — deserves special attention because it is the most commonly skipped step and the most valuable.
 In the challenge pass, a second perspective reviews the evaluation and specifically targets:
@@ -104,11 +104,11 @@ At Level 4, you begin tracking quantitative metrics for evaluation quality:
 | Metric | Target | What It Tells You |
 |--------|--------|------------------|
 | **Cohen's kappa** (human-agent) | > 0.70 | Agreement between human and agent after calibration |
-| **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation --- do human and agent agree on which criteria are strong vs. weak? |
+| **Spearman's rho** (human-agent) | > 0.80 | Rank-order correlation — do human and agent agree on which criteria are strong vs. weak? |
 | **Gate failure rate** | Track trend | How often critical or major gates fail, and for which criteria |
 | **Score distribution** | Compare over time | Are scores clustering (suggesting rubber-stamping) or well-distributed? |
-Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift --- time to re-calibrate.
+Track these metrics per rubric, per team, and over time. A declining kappa suggests calibration drift — time to re-calibrate.
 ## Checkpoint: You Are at Level 4 When...
@@ -116,7 +116,7 @@ Track these metrics per rubric, per team, and over time. A declining kappa sugge
 - [ ] Every agent evaluation includes a challenge pass (Step 6)
 - [ ] Human-agent disagreements of 2 or more points are routinely resolved against level descriptors
 - [ ] You track inter-rater reliability metrics (kappa and/or Spearman's rho)
-- [ ] Agent evaluations are auditable --- evidence anchors, evidence classes, and confidence are captured for every score
+- [ ] Agent evaluations are auditable — evidence anchors, evidence classes, and confidence are captured for every score
 ## Next Steps

package/assets/init/docs/onboarding/first-assessment.md CHANGED Viewed

@@ -8,14 +8,14 @@ This guide walks you from zero to your first scored architecture evaluation. By
 - How to install the EAROS CLI and initialize a workspace
 - The 9 dimensions and 10 criteria of the core meta-rubric
-- How the 0--4 scoring scale works in practice
+- How the 0–4 scoring scale works in practice
 - How to find evidence in an artifact, cite it, and assign a score
 - How gates work and when they override the average
 - How to interpret your evaluation result
 ## Prerequisites
-You need one thing: **an architecture artifact to assess**. This can be any document your organization produces --- a solution design, an ADR, a capability map, a reference architecture, a roadmap. It does not need to be perfect; in fact, a flawed artifact is more instructive for a first assessment.
+You need one thing: **an architecture artifact to assess**. This can be any document your organization produces — a solution design, an ADR, a capability map, a reference architecture, a roadmap. It does not need to be perfect; in fact, a flawed artifact is more instructive for a first assessment.
 ## Installing the CLI
@@ -31,7 +31,7 @@ Then initialize a workspace in your project directory:
 earos init my-workspace
 ```
-This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained --- everything you need is scaffolded into the directory.
+This creates a complete EAROS workspace with rubric files, JSON schemas, agent skills, and an `AGENTS.md` file for AI-assisted evaluation. The workspace is self-contained — everything you need is scaffolded into the directory.
 ## Understanding the Workspace
@@ -39,11 +39,11 @@ This creates a complete EAROS workspace with rubric files, JSON schemas, agent s
 The `earos init` command creates a structured directory containing:
-- **Rubric files** --- the core meta-rubric and all built-in profiles and overlays (YAML)
-- **JSON schemas** --- for validating rubrics, evaluation records, and artifact documents
-- **Agent skills** --- 10 pre-configured skills for AI-assisted evaluation (in `.agents/skills/`)
-- **AGENTS.md** --- agent-agnostic instructions for AI tools like Cursor, Copilot, and Windsurf
-- **Manifest** --- an inventory of all available rubrics
+- **Rubric files** — the core meta-rubric and all built-in profiles and overlays (YAML)
+- **JSON schemas** — for validating rubrics, evaluation records, and artifact documents
+- **Agent skills** — 10 pre-configured skills for AI-assisted evaluation (in `.agents/skills/`)
+- **AGENTS.md** — agent-agnostic instructions for AI tools like Cursor, Copilot, and Windsurf
+- **Manifest** — an inventory of all available rubrics
 ## The Core Meta-Rubric
@@ -61,9 +61,9 @@ The core meta-rubric (`EAROS-CORE-002`) is the universal foundation. It applies
 | **D8: Actionability and implementation relevance** | Can a delivery team act on this artifact without significant guesswork? |
 | **D9: Artifact maintainability and stewardship** | Is the artifact versioned, owned, and structured so it can be maintained over time? |
-For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact --- 10 criteria is manageable for a first pass.
+For your first assessment, you will score every criterion in every dimension. The core rubric is intentionally compact — 10 criteria is manageable for a first pass.
-## The 0--4 Scoring Scale
+## The 0–4 Scoring Scale
 Every criterion uses the same ordinal scale:
@@ -90,23 +90,23 @@ Follow these steps for each of the 10 criteria:
 3. **Record the evidence reference.** Write down where you found it: section number, page, diagram ID, or a direct quote. "Section 3 states: 'Primary stakeholders are the CTO and Head of Payments'" is valid evidence. "The artifact seems to address this" is not.
-4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure --- it provides IF/THEN logic for resolving ambiguous cases.
+4. **Assign the score.** Match what you found against the level descriptors in the `scoring_guide`. Use the `decision_tree` if you are unsure — it provides IF/THEN logic for resolving ambiguous cases.
 5. **Move to the next criterion.** Repeat for all 10 criteria.
 ## Understanding Gates
-Not all criteria are equal. Some have **gates** --- threshold controls that can block a passing status regardless of how well you score on everything else.
+Not all criteria are equal. Some have **gates** — threshold controls that can block a passing status regardless of how well you score on everything else.
 | Gate Severity | What Happens |
 |---------------|-------------|
-| **Critical** | If the score is below the gate threshold, the artifact is blocked from passing. The `failure_effect` determines the outcome --- typically **Reject**, or **Not Reviewable** when evidence is too incomplete to score. |
+| **Critical** | If the score is below the gate threshold, the artifact is blocked from passing. The `failure_effect` determines the outcome — typically **Reject**, or **Not Reviewable** when evidence is too incomplete to score. |
 | **Major** | A low score caps the maximum achievable status (e.g., cannot pass above Conditional Pass). |
 | **Advisory** | A low score triggers a recommendation but does not block any status. |
 In the core rubric, two criteria have **critical** gates: **SCP-01** (Scope and boundary clarity) — if the scope is so unclear that the artifact cannot be reviewed (score < 2), the result is "Not Reviewable" regardless of all other scores; and **CMP-01** (Standards and policy compliance) — if mandatory control compliance cannot be determined, the result is "Reject". **STK-01** (Stakeholder and purpose fit) and **TRC-01** (Traceability) have **major** gates.
-> **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop --- the result is determined by the gate's `failure_effect` (Reject or Not Reviewable). Only then compute the weighted average for the remaining status thresholds.
+> **Rule: Gates before averages.** Always check gate criteria first. If a critical gate fails, stop — the result is determined by the gate's `failure_effect` (Reject or Not Reviewable). Only then compute the weighted average for the remaining status thresholds.
 ## Interpreting Your Results
@@ -115,25 +115,25 @@ After scoring all criteria and checking gates, compute the weighted average acro
 | Status | Threshold |
 |--------|-----------|
 | **Pass** | No critical gate failure, overall average >= 3.2, and no dimension average < 2.0 |
-| **Conditional Pass** | No critical gate failure, overall average 2.4--3.19 (weaknesses are containable with named actions) |
+| **Conditional Pass** | No critical gate failure, overall average 2.4–3.19 (weaknesses are containable with named actions) |
 | **Rework Required** | Overall average < 2.4, or repeated weak dimensions, or insufficient evidence |
 | **Reject** | Any critical gate failure, or mandatory control breach |
 | **Not Reviewable** | Evidence too incomplete to score responsibly |
 ![Scoring a criterion — the editor shows the question, scoring guide, evidence fields, and assigned score](/screenshots/editor-evaluation-result.png)
-A Conditional Pass is not a failure --- it means the artifact is close but needs specific, named improvements before it is decision-ready. Record those improvements as actions in the evaluation record.
+A Conditional Pass is not a failure — it means the artifact is close but needs specific, named improvements before it is decision-ready. Record those improvements as actions in the evaluation record.
 ## Checkpoint: You Are at Level 2 When...
 - [ ] You have completed at least one assessment using the core meta-rubric
-- [ ] Every score has a cited evidence reference --- not "seems adequate" but a specific section, page, or quote
+- [ ] Every score has a cited evidence reference — not "seems adequate" but a specific section, page, or quote
 - [ ] You can explain the difference between a score of 2 and a score of 3 for any criterion
 - [ ] You understand which gates would block a Pass status and why
 - [ ] Your evaluation result includes a status determination (Pass, Conditional Pass, Rework Required, Reject, or Not Reviewable)
 ## Next Steps
-You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process --- with artifact-specific profiles, cross-cutting overlays, and calibrated scoring.
+You now have a reproducible, evidence-backed architecture evaluation. The next step is to scale this from an individual practice to a team-wide governed process — with artifact-specific profiles, cross-cutting overlays, and calibrated scoring.
 Continue to [Governed Review](governed-review.md).

package/assets/init/docs/onboarding/governed-review.md CHANGED Viewed

@@ -2,7 +2,7 @@
 > **Level 2 to 3: Rubric-Based to Governed**
-You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice --- with artifact-specific profiles, context-driven overlays, calibrated teams, and evidence-anchored scoring that is reproducible across your organization.
+You can score an artifact against the core rubric. Now it is time to make architecture review a team-wide, governed practice — with artifact-specific profiles, context-driven overlays, calibrated teams, and evidence-anchored scoring that is reproducible across your organization.
 ## What Changes at This Level
@@ -16,7 +16,7 @@ At Level 2, you used the core rubric and produced evidence-backed scores. At Lev
 ## Choosing a Profile
-The core meta-rubric's 10 criteria are universal --- they apply to every architecture artifact. But a reference architecture has different quality expectations than an ADR, and a capability map is evaluated differently than a roadmap. Profiles add artifact-specific criteria on top of the core --- typically 3 to 9, depending on the artifact type.
+The core meta-rubric's 10 criteria are universal — they apply to every architecture artifact. But a reference architecture has different quality expectations than an ADR, and a capability map is evaluated differently than a roadmap. Profiles add artifact-specific criteria on top of the core — typically 3 to 9, depending on the artifact type.
 | Profile | Artifact Type | Status | What It Adds |
 |---------|--------------|--------|-------------|
@@ -28,7 +28,7 @@ The core meta-rubric's 10 criteria are universal --- they apply to every archite
 > **Status key:** *Approved* profiles have been calibrated and are ready for governed use. *Draft* profiles are usable but have not completed the full calibration process (see [Calibrating with Your Team](#calibrating-with-your-team)).
-Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria --- 13--19 criteria total depending on the profile.
+Every profile declares `inherits: [EAROS-CORE-002]`. This means when you evaluate a reference architecture, you score it against all 10 core criteria **plus** the profile's additional criteria — 13–19 criteria total depending on the profile.
 > **How to choose:** Match the profile to the artifact's declared type. If the artifact does not fit any built-in profile, use the core rubric alone. Creating custom profiles is covered in [Scaling and Optimization](scaling-optimization.md).
@@ -44,7 +44,7 @@ Overlays inject cross-cutting concerns that apply across artifact types. Unlike
 | **Data Governance** (`data-governance.yaml`) | Approved | The artifact describes data flows, data retention, data classification, or data lineage |
 | **Regulatory** (`regulatory.yaml`) | Draft | The artifact operates in a regulated domain: payments, healthcare, financial reporting, privacy |
-Overlays are additive --- they append criteria to the base rubric (core + profile). They cannot remove or weaken gates from the base. An overlay's critical gate adds to the gate model; it does not replace it.
+Overlays are additive — they append criteria to the base rubric (core + profile). They cannot remove or weaken gates from the base. An overlay's critical gate adds to the gate model; it does not replace it.
 You can apply multiple overlays simultaneously. A payments solution architecture might use the solution-architecture profile with both the security and regulatory overlays.
@@ -58,7 +58,7 @@ For each criterion:
 1. Search the artifact for content that addresses the criterion
 2. If you find it, record the evidence anchor: a direct quote, section reference, or diagram ID
-3. Then --- and only then --- match the evidence against the `scoring_guide` level descriptors
+3. Then — and only then — match the evidence against the `scoring_guide` level descriptors
 4. If you cannot find evidence, record N/A and explain why the criterion does not apply, or score 0 and note the absence
 Never score from impression. "The artifact seems to address security" is not evidence. "Section 7.2 states: 'All inter-service communication uses mTLS with certificates rotated every 90 days'" is evidence.
@@ -73,7 +73,7 @@ Every piece of evidence you cite must be classified:
 | **Inferred** | A reasonable interpretation of content that is not directly stated | Medium |
 | **External** | Judgment based on a standard, policy, or source outside the artifact | Lowest |
-Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps --- which is itself a finding worth recording.
+Observed evidence is always preferred. If you find yourself relying heavily on inferred or external evidence, the artifact may have significant gaps — which is itself a finding worth recording.
 ## The Three Evaluation Types
@@ -87,7 +87,7 @@ EAROS distinguishes three distinct judgment types that should not be collapsed i
 These are related but distinct. A beautifully written, complete document can describe an architecturally unsound system. A technically excellent architecture can be documented in an unmaintainable artifact. Collapsing these into one score hides critical information.
-In practice, EAROS criteria map to these three types through the dimension structure --- the narrative summary in the evaluation record should address all three perspectives. The rubric's criterion scores provide the evidence base; the narrative synthesizes them into these three distinct judgments.
+In practice, EAROS criteria map to these three types through the dimension structure — the narrative summary in the evaluation record should address all three perspectives. The rubric's criterion scores provide the evidence base; the narrative synthesizes them into these three distinct judgments.
 ![The rubric editor with file sidebar showing core, profiles, and overlays — the building blocks of governed review](/screenshots/editor-rubric-criteria.png)
@@ -97,7 +97,7 @@ Calibration is what transforms individual scoring into a team capability. Withou
 ### Step-by-step calibration exercise
-1. **Select 3--5 representative artifacts.** Aim for diversity: one strong artifact, one weak, one ambiguous, and one incomplete. The gold-standard example at `examples/aws-event-driven-order-processing/` is an excellent starting point.
+1. **Select 3–5 representative artifacts.** Aim for diversity: one strong artifact, one weak, one ambiguous, and one incomplete. The gold-standard example at `examples/aws-event-driven-order-processing/` is an excellent starting point.
 2. **Have 2+ reviewers score independently.** Each reviewer scores the same artifact against the same rubric without discussing their scores.
@@ -125,7 +125,7 @@ For more on evaluation record structure, see the [Getting Started guide](../gett
 ## Checkpoint: You Are at Level 3 When...
 - [ ] Your team uses a matching profile (not just the core rubric) for every assessment
-- [ ] Every score uses the RULERS protocol --- evidence anchor first, then score
+- [ ] Every score uses the RULERS protocol — evidence anchor first, then score
 - [ ] You have completed a calibration exercise with kappa > 0.70
 - [ ] Overlays are applied based on context (not arbitrarily or never)
 - [ ] Evaluation records are structured and conform to `evaluation.schema.json`
@@ -133,6 +133,6 @@ For more on evaluation record structure, see the [Getting Started guide](../gett
 ## Next Steps
-Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process --- not to replace human judgment, but to augment it with a second independent perspective.
+Your team now produces governed, calibrated, evidence-anchored architecture evaluations. The next step is to bring AI agents into the process — not to replace human judgment, but to augment it with a second independent perspective.
 Continue to [Agent-Assisted Evaluation](agent-assisted.md).

package/assets/init/docs/onboarding/overview.md CHANGED Viewed

@@ -2,12 +2,12 @@
 ## Why Staged Adoption Matters
-Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem --- it is a capability maturity problem.
+Organizations that attempt to leap from ad hoc architecture review to fully automated evaluation almost always fail. The gap is too wide: teams lack the shared vocabulary, calibrated judgment, and institutional habits that make structured review work. This is not a technology problem — it is a capability maturity problem.
 The EAROS Adoption Maturity Model draws on three decades of maturity research:
 - **CMMI** (Capability Maturity Model Integration) established the 5-level progression from initial/ad hoc to optimizing, demonstrating that process maturity is built incrementally.
-- **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement --- not tooling alone.
+- **Gartner IT Score for Enterprise Architecture** identified that EA maturity depends on governance discipline, stakeholder engagement, and measurement — not tooling alone.
 - **OMB EAAF** (Enterprise Architecture Assessment Framework) showed that federal agencies succeed when they build capability in stages aligned to organizational readiness.
 - **TOGAF ACMM** (Architecture Capability Maturity Model) provided the architecture-specific framing: maturity grows from informal practices through defined processes to measured and optimized operations.
@@ -15,25 +15,25 @@ EAROS applies these lessons to a specific domain: architecture artifact evaluati
 ## The Five Levels
-### Level 1 --- Ad Hoc
+### Level 1 — Ad Hoc
 No formal review process. Evaluation quality depends entirely on who happens to review the artifact. Different reviewers apply different mental models, and feedback is inconsistent and unreproducible.
 - **Key practices:** Informal peer review, tribal knowledge
 - **EAROS capabilities:** None (this is the baseline state)
-- **You are here when:** You recognize the problem --- reviews are inconsistent and reviewer-dependent
+- **You are here when:** You recognize the problem — reviews are inconsistent and reviewer-dependent
-### Level 2 --- Rubric-Based
+### Level 2 — Rubric-Based
-The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 0--4 scoring scale. Evidence is cited for every score. Results are reproducible across reviewers.
+The core rubric is adopted. Every assessment uses the same 9 dimensions and 10 criteria with the 0–4 scoring scale. Evidence is cited for every score. Results are reproducible across reviewers.
 - **Key practices:** Manual scoring against core meta-rubric, evidence citation for every score, gate checking
-- **EAROS capabilities:** Core meta-rubric, scoring sheets, 0--4 scale
+- **EAROS capabilities:** Core meta-rubric, scoring sheets, 0–4 scale
 - **You are here when:** You have completed at least one assessment using the core rubric with evidence for every score
 > **Guide:** [Your First Assessment](first-assessment.md) walks you through this transition.
-### Level 3 --- Governed
+### Level 3 — Governed
 Artifact-specific profiles and context-driven overlays are in use. Teams are calibrated against reference examples. The RULERS protocol ensures evidence-anchored scoring. Evaluation records are structured and auditable.
@@ -43,7 +43,7 @@ Artifact-specific profiles and context-driven overlays are in use. Teams are cal
 > **Guide:** [Governed Review](governed-review.md) walks you through this transition.
-### Level 4 --- Hybrid
+### Level 4 — Hybrid
 AI agents augment human reviewers. Both evaluate independently and reconcile disagreements against level descriptors. Metrics track inter-rater reliability between human and agent evaluators.
@@ -53,7 +53,7 @@ AI agents augment human reviewers. Both evaluate independently and reconcile dis
 > **Guide:** [Agent-Assisted Evaluation](agent-assisted.md) walks you through this transition.
-### Level 5 --- Optimized
+### Level 5 — Optimized
 Architecture evaluation is continuous and integrated into delivery workflows. Calibration happens automatically. Executive reporting provides portfolio-level quality visibility. Rubrics are governed assets with version control and change management.
@@ -67,13 +67,13 @@ Architecture evaluation is continuous and integrated into delivery workflows. Ca
 The onboarding guide is organized as four transition guides, one for each level transition:
-1. [Your First Assessment](first-assessment.md) --- Level 1 to 2: Ad Hoc to Rubric-Based
-2. [Governed Review](governed-review.md) --- Level 2 to 3: Rubric-Based to Governed
-3. [Agent-Assisted Evaluation](agent-assisted.md) --- Level 3 to 4: Governed to Hybrid
-4. [Scaling and Optimization](scaling-optimization.md) --- Level 4 to 5: Hybrid to Optimized
+1. [Your First Assessment](first-assessment.md) — Level 1 to 2: Ad Hoc to Rubric-Based
+2. [Governed Review](governed-review.md) — Level 2 to 3: Rubric-Based to Governed
+3. [Agent-Assisted Evaluation](agent-assisted.md) — Level 3 to 4: Governed to Hybrid
+4. [Scaling and Optimization](scaling-optimization.md) — Level 4 to 5: Hybrid to Optimized
 **Sequential reading is recommended.** Each guide builds on concepts introduced in the previous one. However, if you already know your current level from the self-assessment above, you can jump directly to the guide for your next transition.
-> **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation --- everything you need to move from ad hoc to rubric-based review.
+> **Tip:** If you are new to EAROS entirely, start with [Your First Assessment](first-assessment.md). It walks you through installation, the core rubric, and your first scored evaluation — everything you need to move from ad hoc to rubric-based review.
 For deeper reference material, see the [Getting Started guide](../getting-started.md), the [Terminology glossary](../terminology.md), and the full EAROS standard in `standard/EAROS.md`.

package/assets/init/docs/onboarding/scaling-optimization.md CHANGED Viewed

@@ -2,11 +2,11 @@
 > **Level 4 to 5: Hybrid to Optimized**
-Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability --- integrated into delivery workflows, continuously calibrated, and visible to leadership.
+Your team runs hybrid human-agent evaluations with tracked metrics. Now you make architecture review a continuous, automated, organization-wide capability — integrated into delivery workflows, continuously calibrated, and visible to leadership.
 ## What Changes at This Level
-At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers --- triggered automatically, calibrated continuously, and reported to stakeholders who never touch a rubric YAML.
+At Level 4, evaluation is a deliberate activity: someone decides to review an artifact, assigns reviewers, and orchestrates the process. At Level 5, evaluation becomes embedded in how your organization delivers — triggered automatically, calibrated continuously, and reported to stakeholders who never touch a rubric YAML.
 ## CI/CD Integration
@@ -28,11 +28,11 @@ After merge, record evaluation results in a time-series store. This enables tren
 ### Architecture as code
-Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this --- artifacts conforming to `artifact.schema.json` can be validated, scored, and tracked automatically. Encourage teams to adopt structured artifact formats (YAML with frontmatter, ArchiMate exchange, diagram-as-code) rather than unstructured documents.
+Fitness functions work best when architecture artifacts are machine-readable. EAROS is designed for this — artifacts conforming to `artifact.schema.json` can be validated, scored, and tracked automatically. Encourage teams to adopt structured artifact formats (YAML with frontmatter, ArchiMate exchange, diagram-as-code) rather than unstructured documents.
 ## Continuous Calibration
-At earlier levels, calibration is an event --- a scheduled exercise where reviewers score reference artifacts and compare results. At Level 5, calibration becomes continuous.
+At earlier levels, calibration is an event — a scheduled exercise where reviewers score reference artifacts and compare results. At Level 5, calibration becomes continuous.
 ### Wasserstein-based alignment
@@ -62,7 +62,7 @@ The five built-in profiles (solution-architecture, reference-architecture, adr,
 4. **Write up to 12 criteria.** Each criterion needs all required fields: `question`, `description`, `scoring_guide` (all 5 levels), `required_evidence`, `anti_patterns`, `examples.good`, `examples.bad`, `decision_tree`, and `remediation_hints`.
-5. **Calibrate before production.** Score 3--5 representative artifacts with 2+ reviewers. Target kappa > 0.70.
+5. **Calibrate before production.** Score 3–5 representative artifacts with 2+ reviewers. Target kappa > 0.70.
 6. **Publish.** Validate against `rubric.schema.json`, add to the manifest, and document in the changelog.
@@ -74,9 +74,9 @@ For detailed authoring guidance, see the [Profile Authoring Guide](../profile-au
 Use the maturity model itself as a training roadmap. New team members start at Level 1 and progress through the guides:
-- **Week 1:** Complete [Your First Assessment](first-assessment.md) --- score a real artifact against the core rubric
-- **Week 2:** Complete [Governed Review](governed-review.md) --- join a calibration exercise, learn profiles and overlays
-- **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md) --- run a hybrid evaluation and reconcile disagreements
+- **Week 1:** Complete [Your First Assessment](first-assessment.md) — score a real artifact against the core rubric
+- **Week 2:** Complete [Governed Review](governed-review.md) — join a calibration exercise, learn profiles and overlays
+- **Week 3:** Complete [Agent-Assisted Evaluation](agent-assisted.md) — run a hybrid evaluation and reconcile disagreements
 - **Ongoing:** Participate in review rotations and calibration exercises
 ### Governance
@@ -90,7 +90,7 @@ Rubrics are governed assets at Level 5. This means:
 ### Culture
-The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate --- a hoop to jump through before deployment --- adoption will be grudging and superficial.
+The most common failure mode for architecture review frameworks is perception. If teams see EAROS as a bureaucratic gate — a hoop to jump through before deployment — adoption will be grudging and superficial.
 Position EAROS as a quality tool, not a gatekeeping tool:
@@ -110,8 +110,8 @@ The `earos-report` skill generates portfolio-level views from evaluation records
 - **Traffic-light dashboards:** Red/amber/green status for each evaluated artifact, grouped by team, domain, or portfolio
 - **Dimension trends:** Which quality dimensions are improving or declining across the portfolio over time
-- **Gate failure hotspots:** Which criteria most frequently trigger gate failures --- these are systemic weaknesses worth investing in
-- **Remediation tracking:** Status of actions from Conditional Pass evaluations --- are they being completed?
+- **Gate failure hotspots:** Which criteria most frequently trigger gate failures — these are systemic weaknesses worth investing in
+- **Remediation tracking:** Status of actions from Conditional Pass evaluations — are they being completed?
 ### Aggregating across the portfolio
@@ -139,7 +139,7 @@ A rising first-pass Pass rate is the strongest signal that EAROS is working: art
 ## Checkpoint: You Are at Level 5 When...
 - [ ] Architecture evaluation is integrated into your CI/CD or delivery pipeline
-- [ ] Calibration happens continuously, not just at setup time --- drift is detected and triggers re-calibration
+- [ ] Calibration happens continuously, not just at setup time — drift is detected and triggers re-calibration
 - [ ] You create and maintain custom profiles for your organization's artifact types
 - [ ] Executive reporting provides portfolio-level quality visibility on a regular cadence
 - [ ] Rubric updates follow a governed change process (version bumps, owner approval, re-calibration)
@@ -147,7 +147,7 @@ A rising first-pass Pass rate is the strongest signal that EAROS is working: art
 ## What Comes Next
-Level 5 is not a destination --- it is a steady state of continuous improvement. From here:
+Level 5 is not a destination — it is a steady state of continuous improvement. From here:
 - **Contribute back.** EAROS is open source. If you create profiles for artifact types that others would benefit from, consider contributing them to the project.
 - **Share calibration data.** Cross-organizational calibration data strengthens the framework for everyone. Anonymized score distributions help improve the Wasserstein calibration baselines.