npm - @trohde/earos - Versions diffs - 1.0.0 - Mend

@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (135) hide show

package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md ADDED Viewed

@@ -0,0 +1,280 @@
+# Criterion Writing Guide — EAROS Profile Author
+This file explains how to write well-formed EAROS criteria with all 13 required v2 fields. Read this before drafting any criteria.
+---
+## Why Criterion Quality Determines Profile Reliability
+A criterion is not just a question — it is an assessment instruction. A well-written criterion tells the evaluator exactly what to look for, how to classify what they find, and what each score level means. A poorly written criterion leaves the evaluator guessing, which leads to inconsistent scores, low inter-rater reliability, and a profile that cannot be used in governance.
+The 13 required v2 fields exist because each one has been found to reduce ambiguity. Missing any of them is not just a schema violation — it is a reliability risk.
+---
+## The 13 Required Fields
+| Field | Purpose |
+|-------|---------|
+| `id` | Unique identifier for cross-referencing in evaluation records |
+| `question` | The scoring question — what the evaluator is asking about the artifact |
+| `description` | Why this matters — the quality concern this criterion encodes |
+| `metric_type` | Always `ordinal` in EAROS |
+| `scale` | Always `[0, 1, 2, 3, 4, "N/A"]` |
+| `gate` | Gate configuration or `false` |
+| `required_evidence` | List of specific things to look for in the artifact |
+| `scoring_guide` | One-sentence level descriptors for scores 0–4 |
+| `anti_patterns` | Common failure modes to watch for |
+| `examples.good` | What strong evidence looks like (score 3–4) |
+| `examples.bad` | What absent or weak evidence looks like (score 0–1) |
+| `decision_tree` | Observable conditions that resolve ambiguous scoring |
+| `remediation_hints` | Specific improvements that would raise the score |
+---
+## Field-by-Field Guidance
+### `question`
+The question should be:
+- Specific to this artifact type (not generic)
+- Answerable from artifact content (not requiring external research)
+- Focused on a single quality concern (not compound)
+**Good:** "Does the reference architecture include context, functional, deployment, and data flow views?"
+**Bad:** "Is the architecture complete and well-documented?"
+The bad example fails because "complete" and "well-documented" are two different concerns, and "complete" is vague.
+---
+### `scoring_guide`
+Write one sentence per level that describes what the artifact contains at that level — not what the evaluator should do.
+**Pattern:**
+```yaml
+scoring_guide:
+  "0": "[Absent description — criterion entirely missing or directly contradicted]"
+  "1": "[Weak description — acknowledged or implied but inadequate]"
+  "2": "[Partial description — present but incomplete, inconsistent, or weakly evidenced]"
+  "3": "[Good description — clearly addressed with adequate evidence and only minor gaps]"
+  "4": "[Strong description — fully addressed, well evidenced, internally consistent, decision-ready]"
+```
+**Common mistake:** Writing what the evaluator should do, not what the artifact shows.
+**Wrong:**
+```yaml
+"3": "Check whether the artifact covers most of the required views."
+```
+**Correct:**
+```yaml
+"3": "Three or more views present with adequate detail; data flow view exists but lacks narrative."
+```
+The key test: could an evaluator read the level descriptor and know immediately which score to assign without needing to interpret?
+---
+### `decision_tree`
+The decision tree translates the scoring guide into observable conditions. It is especially important when:
+- The scoring guide levels are close together (2 vs. 3 is often ambiguous)
+- Scoring requires counting specific features
+- There are compound conditions (A AND B = score 3; A OR B = score 2)
+**Pattern:** Start with the lowest score condition and work upward. Use IF/THEN structure.
+**Good example:**
+```yaml
+decision_tree: >
+  Count distinct architectural views (context, component, deployment, data flow, security):
+  IF 0 views THEN score 0.
+  IF 1 view only THEN score 1.
+  IF 2-3 views present THEN score 2.
+  IF 4+ views AND data flow narrative exists THEN score 3.
+  IF 4+ views AND all cross-referenced AND security view included THEN score 4.
+```
+**Bad example:**
+```yaml
+decision_tree: >
+  Evaluate the completeness of the architectural views and assign a score based on quality.
+```
+This is not a decision tree — it just restates the criterion question.
+---
+### `required_evidence`
+List the specific artifact elements an evaluator should search for. These become the RULERS evidence anchors.
+**Good:**
+```yaml
+required_evidence:
+  - context diagram (C4 Level 1 or equivalent) showing system boundaries
+  - deployment diagram showing infrastructure topology
+  - data flow walkthrough (numbered steps or annotated sequence diagram)
+  - component diagram showing service decomposition
+```
+**Bad:**
+```yaml
+required_evidence:
+  - architectural documentation
+  - diagrams
+```
+The bad example gives the evaluator no guidance on what specifically to find.
+---
+### `examples.good` and `examples.bad`
+These are the most important fields for calibration. Include direct quotes or realistic paraphrases from the artifact — not descriptions of what good looks like.
+**Good examples:**
+```yaml
+examples:
+  good:
+    - >
+      "Section 3 provides C4 Level 1 context diagram. Section 4 shows container decomposition.
+      Section 5 includes AWS deployment topology with AZ distribution. Section 6 contains
+      a 12-step numbered data flow for the payment processing path."
+  bad:
+    - "See architecture diagram on page 3."
+    - >
+      "Figure 1: System Overview. [Single box-and-arrow diagram with no explanatory text,
+      no deployment details, no data flows.]"
+```
+**Why quotes matter:** During calibration, reviewers compare their scores to the examples. If the examples are descriptions rather than quotes, reviewers cannot determine whether their artifact matches.
+---
+### Gate Guidance {#gate-guidance}
+Gates prevent bad scores being hidden by weighted averages. But every gate is a potential false reject — a criterion that rejects a genuinely good artifact because it misses one element.
+**When to use `gate: false`:**
+- The criterion contributes to the score but a low score here doesn't invalidate the whole artifact
+- Most criteria should be `gate: false` or `severity: advisory`
+**When to use `severity: major`:**
+- The criterion covers the most important quality dimension for this artifact type
+- A score below 2 here means the artifact cannot serve its primary purpose
+- Example: missing views in a viewpoint-centred profile
+**When to use `severity: critical`:**
+- The criterion covers a compliance-level concern — mandatory control, regulatory requirement, or minimum governance standard
+- A failure here means the artifact cannot proceed in any state
+- Reserve this for absolute must-haves: usually 0–1 per profile
+**Target gate distribution per profile:**
+```
+gate: false              → 60-70% of criteria
+severity: advisory       → 10-20% of criteria
+severity: major          → 1-2 criteria
+severity: critical       → 0-1 criteria
+```
+**Over-gating example (bad):**
+```yaml
+# 5 major gates in a 10-criterion profile
+# Result: almost any artifact with a weak section fails the whole review
+# This defeats the purpose of the weighted average
+```
+**Under-gating example (bad):**
+```yaml
+# 0 gates in a security profile
+# Result: an artifact with no security controls at all can still "pass" on a high average
+# Gates exist precisely to prevent this
+```
+---
+## Complete Criterion Example
+**Incomplete (bad) — will fail schema validation and produce unreliable scores:**
+```yaml
+- id: PM-ROOT-01
+  question: "Does the post-mortem identify the root cause?"
+  scoring_guide:
+    "0": "No root cause"
+    "3": "Root cause identified"
+  gate: false
+```
+Missing 9 of 13 required fields. Evaluators have no guidance on scores 1, 2, 4; no evidence to look for; no examples; no decision tree.
+**Complete (good) — all 13 fields present:**
+```yaml
+- id: PM-ROOT-01
+  question: "Does the post-mortem identify the root cause with supporting evidence?"
+  description: >
+    Root cause identification is the primary purpose of a post-mortem. Without a
+    specific, evidenced root cause, the post-mortem cannot drive effective prevention.
+    "Human error" and "process failure" are not root causes — they are proxies for
+    the conditions that enabled the failure.
+  metric_type: ordinal
+  scale: [0, 1, 2, 3, 4, "N/A"]
+  gate:
+    enabled: true
+    severity: major
+    failure_effect: Cannot pass if the post-mortem does not identify a specific root cause
+  required_evidence:
+    - explicit root cause statement (not just a timeline)
+    - contributing factors (conditions that enabled the root cause)
+    - evidence supporting the root cause conclusion (data, logs, timeline analysis)
+  scoring_guide:
+    "0": "No root cause section, or root cause stated as 'human error' / 'process failure' without further analysis."
+    "1": "Root cause implied or mentioned superficially with no supporting evidence."
+    "2": "Specific root cause stated but supporting evidence absent or limited to timeline."
+    "3": "Specific root cause stated with supporting evidence and at least one contributing factor identified."
+    "4": "Specific root cause with full evidence chain, multiple contributing factors, and causal relationships mapped."
+  anti_patterns:
+    - "Root cause: Human error" — this is a contributing factor, not a root cause
+    - "Root cause: TBD" — post-mortem cannot be used for prevention without this
+    - Root cause stated but no evidence provided for why this was the cause
+  examples:
+    good:
+      - >
+        "Root cause: Race condition in payment state machine between the timeout handler
+        and the confirmation webhook processor. Contributing factors: (1) Missing mutex on
+        shared payment state object (2) Timeout threshold (30s) shorter than downstream
+        webhook delivery SLA (45s). Evidence: Log analysis shows 47 concurrent state
+        transitions in 2.3s during the incident window (Appendix A)."
+    bad:
+      - "Root cause: Engineering team failed to test edge cases."
+      - "Root cause: See timeline above."
+  decision_tree: >
+    IF no root cause section THEN score 0.
+    IF root cause is 'human error' or 'process failure' without further drill-down THEN score 0-1.
+    IF specific technical root cause stated but no evidence THEN score 2.
+    IF specific root cause with supporting evidence THEN score 3.
+    IF specific root cause AND full evidence chain AND contributing factors mapped THEN score 4.
+  remediation_hints:
+    - Apply the "5 Whys" technique to drill below human error to systemic causes
+    - Attach log excerpts or data to support the stated root cause
+    - Add a contributing factors section listing the conditions that enabled the root cause
+```
+---
+## Criterion Review Checklist
+Before saving a criterion, verify:
+- [ ] `question` is specific and answerable from artifact content alone
+- [ ] `scoring_guide` uses artifact-content language, not evaluator-action language
+- [ ] `scoring_guide` distinguishes clearly between each adjacent pair (0/1, 1/2, 2/3, 3/4)
+- [ ] `decision_tree` resolves the most common ambiguous case (usually 2 vs. 3)
+- [ ] `required_evidence` lists specific artifact elements, not general categories
+- [ ] `examples.good` contains a realistic quote or paraphrase from a strong artifact
+- [ ] `examples.bad` contains the actual common failure mode, not just an empty section
+- [ ] `gate` assignment is deliberate (not defaulted)
+- [ ] `remediation_hints` are specific verb-first actions, not general advice

package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md ADDED Viewed

@@ -0,0 +1,158 @@
+# Design Methods — EAROS Profile Author
+This file describes the 5 design methods for EAROS profiles. Read this before choosing a method (Step 2).
+---
+## Why Design Methods Matter
+A profile's design method shapes its dimensional structure and criterion types. Choosing the wrong method produces criteria that feel disconnected from the artifact type — assessors struggle to apply them, calibration fails, and the profile is abandoned.
+The five methods are not about the content of the architecture — they are about the primary *evaluative lens* the profile applies.
+---
+## Method A — Decision-Centred
+**Best for:** ADRs, investment reviews, exception requests, approval documents
+**Core question:** "Is this document adequate to support a governance decision?"
+**Why:** Decision-focused artifacts are evaluated primarily on whether they enable a clear, informed decision. The architecture content matters less than the decision structure: was the context clear, were alternatives considered, is the rationale sound, is the decision reversible or final?
+**Dimension structure typically includes:**
+- Decision context and framing (why is a decision needed?)
+- Options analysis (what was considered, why rejected)
+- Decision statement and rationale
+- Reversibility and revisit conditions
+- Stakeholder alignment
+**Signature criteria:**
+- Options presented with comparative analysis (not just the chosen option)
+- Decision consequences made explicit
+- Revisit/escalation conditions named
+**Example profile:** `profiles/adr.yaml`
+**Key indicator to choose Method A:** The primary artifact purpose is to get approval or record a governance decision — not to describe an architecture in full.
+---
+## Method B — Viewpoint-Centred
+**Best for:** Capability maps, reference architectures, solution architectures, platform blueprints
+**Core question:** "Does this artifact address the concerns of all relevant stakeholders through appropriate architectural views?"
+**Why:** Viewpoint-centred artifacts are evaluated on their completeness across multiple perspectives (context, functional, deployment, data, security) and how well those views address the stated stakeholder concerns. The presence and quality of views is the primary quality signal.
+**Dimension structure typically includes:**
+- Views and diagrams coverage
+- Stakeholder concern coverage
+- Cross-view consistency
+- Notation and annotation quality
+**Signature criteria:**
+- Multiple views present (minimum: context, component, deployment)
+- Views explicitly mapped to stakeholder concerns
+- Consistent terminology and component naming across views
+**Example profiles:** `profiles/reference-architecture.yaml`, `profiles/solution-architecture.yaml`
+**Key indicator to choose Method B:** The artifact is expected to contain multiple diagrams and the audience needs different perspectives on the same system.
+---
+## Method C — Lifecycle-Centred
+**Best for:** Transition designs, roadmaps, handover documents, migration plans
+**Core question:** "Does this artifact support the full lifecycle — current state, future state, and the path between them?"
+**Why:** Lifecycle artifacts are evaluated on whether they adequately describe the journey, not just the destination. Current state, future state, transition steps, dependencies, and rollback conditions are all essential. An artifact that only describes the target state fails because it leaves delivery teams without a path.
+**Dimension structure typically includes:**
+- Current state description
+- Future/target state description
+- Transition pathway (phases, milestones, dependencies)
+- Risk and rollback
+- Ownership across lifecycle phases
+**Signature criteria:**
+- Current state explicitly described (not just assumed)
+- Transition steps sequenced with dependencies
+- Rollback or abort conditions named
+**Example profile:** `profiles/roadmap.yaml`
+**Key indicator to choose Method C:** The artifact describes a change over time, not just a static design.
+---
+## Method D — Risk-Centred
+**Best for:** Security architectures, regulatory compliance designs, resilience architectures, threat models
+**Core question:** "Does this artifact identify, mitigate, and accept risks at a level appropriate for the risk domain?"
+**Why:** Risk-centred artifacts are evaluated on completeness of risk identification and adequacy of mitigations, not just architectural soundness. The primary failure mode is incomplete risk coverage — threats not considered, mitigations not proportionate, residual risk not accepted by a named authority.
+**Dimension structure typically includes:**
+- Risk identification scope and completeness
+- Mitigation design and proportionality
+- Residual risk acceptance
+- Control implementation evidence
+- Compliance coverage
+**Signature criteria:**
+- Threat model or risk register covering defined scope
+- Mitigations proportionate to risk likelihood × impact
+- Named authority accepting residual risks
+- Control-to-requirement traceability
+**Key indicator to choose Method D:** The primary purpose of the artifact is to demonstrate that risks have been identified and managed — not just to describe the architecture.
+**Note:** The security and regulatory overlays (`overlays/security.yaml`, `overlays/regulatory.yaml`) often apply alongside Method D profiles but are not substitutes for a D-method profile when the artifact is primarily risk-focused.
+---
+## Method E — Pattern-Library
+**Best for:** Recurring reference patterns, platform blueprints, golden-path designs
+**Core question:** "Is this pattern sufficiently defined, validated, and reusable that teams can adopt it without extensive customization?"
+**Why:** Pattern-library artifacts are evaluated on their reusability and adoption-readiness, not just their technical correctness. A pattern that is architecturally sound but undocumented at the decision point level isn't reusable — teams have to recreate the design rationale each time. The primary failure mode is an artifact that is a good architecture but a poor pattern.
+**Dimension structure typically includes:**
+- Pattern definition and applicability conditions
+- Implementation completeness (is there enough to act on?)
+- Reuse guidance (when to use, when not to use, variants)
+- Evolution and versioning
+- Validation evidence (is this proven in production?)
+**Signature criteria:**
+- Named applicability conditions ("use this when X, don't use when Y")
+- Canonical implementation example
+- Known variants documented
+- Adoption metrics or proven instances
+**Example profile:** `profiles/reference-architecture.yaml` (uses Method E)
+**Key indicator to choose Method E:** Teams are expected to adopt this pattern repeatedly — it needs to work as a template, not just a one-time design.
+---
+## Choosing Between Methods — Decision Guide
+| Situation | Method |
+|-----------|--------|
+| "We need to approve a specific decision" | A — Decision-Centred |
+| "We need multiple teams to understand this system from different angles" | B — Viewpoint-Centred |
+| "We're describing how to get from A to B over time" | C — Lifecycle-Centred |
+| "The primary purpose is to show risks are controlled" | D — Risk-Centred |
+| "Teams will use this repeatedly as a template" | E — Pattern-Library |
+**When in doubt:** Method B (Viewpoint-Centred) is the most general and works well for most architecture artifacts that don't fit a more specific method.
+**Combinations:** Some artifacts have secondary concerns from another method. Handle this by choosing the primary method and adding criteria from the secondary concern to relevant dimensions, rather than trying to combine two methods.

package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md ADDED Viewed

@@ -0,0 +1,173 @@
+# Profile Validation Checklist — EAROS Profile Author
+This checklist must be completed before publishing a profile or overlay. Read it before Step 6 (pre-publication checks).
+---
+## Why a Checklist?
+Profiles that skip validation steps cause silent failures in evaluations. A missing field in one criterion might not be caught until the profile is used in production, by which time evaluations have already been produced on a flawed rubric. Running this checklist before publishing catches errors when they are cheap to fix.
+---
+## Part 1 — Structural Validation
+### 1.1 Required Top-Level Fields
+Check that each of these is present and correctly typed:
+| Field | Required Value |
+|-------|---------------|
+| `rubric_id` | Unique string, format `EAROS-<ARTIFACT>-<NNN>` |
+| `version` | Semver format (e.g., `1.0.0`) |
+| `kind` | `profile` or `overlay` |
+| `title` | Non-empty string |
+| `status` | `draft` (for new profiles) |
+| `effective_date` | `YYYY-MM-DD` format |
+| `owner` | `enterprise-architecture` |
+| `artifact_type` | Snake_case string |
+| `inherits` | `[EAROS-CORE-002]` (profiles only; absent for overlays) |
+| `design_method` | One of the 5 valid methods |
+| `dimensions` | Non-empty list |
+| `scoring` | Object with required sub-fields |
+| `outputs` | Object with required sub-fields |
+| `calibration` | Object with `required_before_production: true` |
+| `change_log` | List with at least one entry |
+**Overlays specifically:**
+- [ ] No `inherits` field
+- [ ] `scoring.method: append_to_base_rubric`
+### 1.2 Scoring Block
+```yaml
+scoring:
+  scale: 0-4 ordinal plus N/A
+  method: gates_first_then_weighted_average  # overlays: append_to_base_rubric
+  thresholds:
+    pass: No critical gate failure, overall >= 3.2, no dimension < 2.0
+    conditional_pass: No critical gate failure, overall 2.4-3.19
+    rework_required: Overall < 2.4 or repeated weak dimensions
+    reject: Critical gate failure or mandatory control breach
+    not_reviewable: Evidence insufficient for core gate criteria
+  na_policy: Exclude N/A criteria from denominator; evaluator must justify N/A
+  confidence_policy: Confidence reported separately, must not modify score
+```
+### 1.3 Outputs Block
+```yaml
+outputs:
+  require_evidence_refs: true
+  require_confidence: true
+  require_actions: true
+  require_evidence_class: true
+  require_evidence_anchors: true
+```
+---
+## Part 2 — Criterion Completeness
+For every criterion, verify all 13 v2 fields are present:
+| Field | Check |
+|-------|-------|
+| `id` | [ ] Present, unique, format `<ARTIFACT>-<AREA>-<NN>` |
+| `question` | [ ] Present, specific, single concern |
+| `description` | [ ] Present, explains WHY this matters |
+| `metric_type` | [ ] Value is exactly `ordinal` |
+| `scale` | [ ] Value is exactly `[0, 1, 2, 3, 4, "N/A"]` |
+| `gate` | [ ] Either `false` or object with `enabled`, `severity`, `failure_effect` |
+| `required_evidence` | [ ] Non-empty list of specific artifact elements |
+| `scoring_guide` | [ ] All keys "0" through "4" present with content |
+| `anti_patterns` | [ ] Non-empty list |
+| `examples.good` | [ ] Non-empty list with realistic quote or paraphrase |
+| `examples.bad` | [ ] Non-empty list with the actual common failure mode |
+| `decision_tree` | [ ] Non-empty string with IF/THEN branches |
+| `remediation_hints` | [ ] Non-empty list of verb-first actions |
+---
+## Part 3 — ID Uniqueness
+Before publishing, verify no ID collisions:
+- [ ] Profile `rubric_id` not already used in `core/`, `profiles/`, or `overlays/`
+- [ ] All criterion IDs (`id` fields) unique across ALL rubric files in the repo
+- [ ] All dimension IDs unique within this profile file
+To check: scan `core/core-meta-rubric.yaml`, all files in `profiles/`, and all files in `overlays/` for any matching IDs.
+**Common mistake:** Using short IDs like `D1`, `CRT-01` that collide with core rubric dimension IDs.
+---
+## Part 4 — Gate Distribution Check
+Review gate assignments across the profile and verify:
+| Gate Type | Target Count | Actual Count |
+|-----------|-------------|--------------|
+| `critical` | 0–1 | [ ] |
+| `major` | 1–2 | [ ] |
+| `advisory` | 0–3 | [ ] |
+| `gate: false` | Most criteria | [ ] |
+**Red flags:**
+- More than 2 `major` gates → likely over-gating; review whether all are truly fatal
+- 0 gates on a security/compliance-focused profile → likely under-gating
+- `critical` gate on a non-compliance criterion → review whether this is warranted
+---
+## Part 5 — Criterion Count Check
+| Check | Target | Actual |
+|-------|--------|--------|
+| Total profile-specific criteria | 5–12 | [ ] |
+| New dimensions | 2–6 | [ ] |
+| Criteria that duplicate core criteria | 0 | [ ] |
+To verify: read `core/core-meta-rubric.yaml` and compare each new criterion's `question` to ensure it covers a genuinely different concern.
+---
+## Part 6 — Schema Validation
+If you have a YAML validator available, validate against `standard/schemas/rubric.schema.json`.
+If not, manually verify the most common schema violations:
+- [ ] All string keys are quoted where required (especially numeric keys in `scoring_guide`: `"0":`, `"1":`, etc.)
+- [ ] Two-space indentation throughout (not 4-space, not tabs)
+- [ ] Lists use `- item` format, not inline `[item1, item2]` for multi-line lists
+- [ ] Multi-line descriptions use `>` block scalar, not `|` (unless you need to preserve newlines)
+- [ ] File name matches convention: `<artifact-type>.yaml` (kebab-case)
+---
+## Part 7 — Pre-Calibration Checklist
+Before the profile can be used in production:
+- [ ] At least 3 real artifacts collected for calibration (1 strong ≥3.2, 1 weak <2.4, 1 ambiguous)
+- [ ] Calibration artifacts documented in `calibration/gold-set/` or accessible to reviewers
+- [ ] 2+ reviewers identified for independent scoring
+- [ ] Profile `status: draft` until calibration is complete
+After successful calibration:
+- [ ] Profile status changed to `candidate` (then `approved` after governance sign-off)
+- [ ] Calibration results saved to `calibration/results/`
+- [ ] Worked evaluation example added to `examples/`
+- [ ] Profile mentioned in CHANGELOG.md
+---
+## Final Sign-Off
+Before committing the profile:
+- [ ] All Part 1–6 checks completed
+- [ ] No duplicate IDs found
+- [ ] Gate distribution reviewed and approved
+- [ ] Criterion count within 5–12
+- [ ] `earos-validate` skill run on the full repo (catches cross-file issues)