npm - @trohde/earos - Versions diffs - 1.0.0 - Mend

@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (135) hide show

package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md ADDED Viewed

@@ -0,0 +1,263 @@
+# Calibration Protocol
+Step-by-step procedure for running an EAROS calibration exercise. Follow this exactly — shortcuts produce unreliable results.
+---
+## Phase 1 — Setup
+### 1.1 Identify the rubric being calibrated
+Confirm:
+- `rubric_id` (e.g., `EAROS-POSTMORTEM-001`)
+- Current `status` (should be `draft` before calibration; changes to `candidate` after passing)
+- Date of last calibration (if any)
+### 1.2 Assemble the calibration artifact set
+**Minimum set (3 artifacts):**
+| Artifact | Expected score range | Purpose |
+|----------|---------------------|---------|
+| Strong | ≥ 3.2 overall | Confirms the rubric correctly identifies high-quality artifacts |
+| Weak | < 2.4 overall | Confirms the rubric correctly identifies poor artifacts |
+| Ambiguous | 2.4–3.2 overall | Tests boundary detection — the hardest cases |
+**Recommended set (4 artifacts):**
+Add a fourth: an artifact with a known gate failure. This verifies the gate logic and its effect on status.
+**Requirements for calibration artifacts:**
+- Real artifacts (not synthetic examples created to match the rubric)
+- Representative of the artifact type in practice
+- Diverse: different teams, different systems, different quality levels
+- Stored in `calibration/gold-set/[rubric-id]/` with known scores recorded
+### 1.3 Record known gold-set scores
+For each artifact, the gold-set must contain:
+```yaml
+artifact_id: [ID]
+artifact_title: [title]
+known_status: [pass | conditional_pass | rework_required | reject]
+known_overall_score: [X.X]
+criterion_scores:
+  - criterion_id: [ID]
+    gold_score: [0-4 or N/A]
+    gold_evidence_class: [observed | inferred | external]
+    gold_rationale: "[why this score was assigned]"
+```
+Do NOT share these gold-set scores with the evaluator before they complete their independent assessment.
+---
+## Phase 2 — Independent Scoring
+### 2.1 Read the rubric
+Read the full rubric YAML for the profile being calibrated. Read `core/core-meta-rubric.yaml` as well — core criteria are always evaluated.
+Do NOT read the gold-set scores. Close any document containing them.
+### 2.2 Score each artifact independently
+Follow the standard earos-assess 8-step DAG for each artifact:
+```
+structural_validation
+  → content_extraction
+    → criterion_scoring
+      → cross_reference_validation
+        → dimension_aggregation
+          → challenge_pass
+            → calibration
+              → status_determination
+```
+For each criterion, record:
+```yaml
+criterion_id: [ID]
+score: [0-4 or N/A]
+evidence_anchor: "[direct quote or specific section reference]"
+evidence_class: [observed | inferred | external]
+confidence: [high | medium | low]
+rationale: "[why this score was assigned — specific, not vague]"
+```
+**Critical rules during scoring:**
+- Extract an evidence anchor BEFORE assigning a score (RULERS protocol)
+- If no evidence can be found, record N/A and explain why — do not score from impression
+- Do not look at the gold-set until scoring for all artifacts is complete
+### 2.3 Complete the challenge_pass step
+Before finalising scores, run the internal challenge:
+- For each score of 3 or 4: "What specific evidence in the artifact earns this score?"
+- If the answer is vague, revisit the scoring_guide and decision_tree
+- Document the challenge outcome in the evaluation record
+### 2.4 Determine status for each artifact
+Apply the status thresholds:
+1. Check gate failures first — any critical gate failure = reject
+2. Check overall score: ≥ 3.2 = pass, 2.4–3.19 = conditional_pass, < 2.4 = rework_required
+3. Check dimension floor: no dimension < 2.0 for a pass status
+Record the determined status.
+---
+## Phase 3 — Score Comparison
+### 3.1 Access the gold-set scores
+Only after completing your independent assessment for all artifacts, read the gold-set scores.
+### 3.2 Build the comparison table
+For each artifact, for each criterion:
+```
+artifact_id     | criterion_id | gold_score | agent_score | delta | agreement
+[ID]            | [ID]         | [score]    | [score]     | [d]   | exact|within_1|disagreement
+```
+Where:
+- `delta = gold_score - agent_score`
+- Positive delta = agent under-scored (too harsh)
+- Negative delta = agent over-scored (too generous)
+- `agreement = exact` if delta = 0; `within_1` if |delta| = 1; `disagreement` if |delta| ≥ 2
+### 3.3 Flag systematic bias
+Compute mean delta across all criteria:
+- Mean delta near 0 = no systematic bias
+- Mean delta > +0.5 = agent consistently under-scores (too harsh)
+- Mean delta < -0.5 = agent consistently over-scores (too generous)
+Systematic bias is more serious than random disagreement — it indicates a calibration problem that will affect every future evaluation.
+---
+## Phase 4 — Metric Computation
+Read `references/agreement-metrics.md` for computation procedures.
+Compute and record:
+1. Binary agreement (artifact status match rate)
+2. Weighted kappa per criterion
+3. Spearman ρ across artifact overall scores
+4. Reliability flag per criterion: `reliable` | `moderate` | `unreliable`
+---
+## Phase 5 — Root Cause Analysis
+For each `disagreement` (|delta| ≥ 2) or `unreliable` criterion:
+**Five root cause categories:**
+1. **Ambiguous level descriptor**
+   - Symptom: Both evaluators can justify their score from the current scoring_guide
+   - Fix: Sharpen the level descriptors, especially at the 2/3 boundary
+   - Example fix: Add a specific observable feature that distinguishes score 2 from score 3
+2. **Missing or vague decision_tree**
+   - Symptom: The decision_tree doesn't resolve the specific case that caused disagreement
+   - Fix: Add an explicit branch for the ambiguous condition
+   - Example fix: "IF X present but Y absent THEN score 2 (not 3)"
+3. **Evidence classification disagreement**
+   - Symptom: Evaluators agree on what the artifact says but classify differently (observed vs. inferred)
+   - Fix: Add an `examples.good` entry clarifying what qualifies as `observed` for this criterion
+   - Example fix: "Quote must include [specific field] to qualify as observed"
+4. **Anti-pattern not captured**
+   - Symptom: The artifact exhibits a form of failure the rubric doesn't explicitly call out
+   - Fix: Add the new failure mode to `anti_patterns` and update the scoring_guide accordingly
+5. **Context sensitivity**
+   - Symptom: The criterion means something different for this artifact than the rubric anticipates
+   - Fix: Add conditional guidance to the `description` or `decision_tree`
+   - Example fix: "For artifacts covering only a single domain, N/A is acceptable for criterion X"
+---
+## Phase 6 — Calibration Report
+Save results to: `calibration/results/[rubric-id]-calibration-[YYYY-MM-DD].yaml`
+Report format:
+```yaml
+calibration_id: CAL-[RUBRIC-ID]-[YYYYMMDD]
+rubric_id: [rubric ID]
+calibration_date: [today]
+artifacts_scored: [N]
+evaluator: agent
+summary_metrics:
+  binary_agreement: "[X%]"
+  mean_delta: [X.X]
+  spearman_rho: [X.XX]
+  overall_verdict: [pass_for_production | borderline | not_ready]
+criterion_results:
+  - criterion_id: [ID]
+    gold_score: [score]
+    agent_score: [score]
+    delta: [delta]
+    agreement: [exact | within_1 | disagreement]
+    reliability_flag: [reliable | moderate | unreliable]
+    root_cause: "[if disagreement — brief root cause analysis]"
+    recommended_rubric_change: "[if unreliable — specific change]"
+artifact_results:
+  - artifact_id: [ID]
+    gold_status: [status]
+    agent_status: [status]
+    gold_score: [X.X]
+    agent_score: [X.X]
+    status_match: [true | false]
+    notes: "[any notable findings]"
+recommendations:
+  proceed_to_production: [true | false]
+  blocking_issues:
+    - "[issue preventing production use, if any]"
+  rubric_improvements:
+    - criterion_id: [ID]
+      field_to_change: [scoring_guide | decision_tree | anti_patterns | examples]
+      change_description: "[specific change]"
+```
+---
+## Phase 7 — Post-Calibration Actions
+**If verdict is `pass_for_production`:**
+1. Change profile `status` from `draft` to `candidate`
+2. Update `calibration_date` in the profile YAML
+3. Schedule next calibration (6 months or at next major rubric revision)
+**If verdict is `borderline`:**
+1. Implement recommended rubric improvements
+2. Re-run calibration on the same artifacts to verify improvement
+3. Do not promote to `candidate` until re-calibration passes
+**If verdict is `not_ready`:**
+1. Analyse root causes for all unreliable criteria
+2. Revise `scoring_guide`, `decision_tree`, and `examples` for each
+3. Rebuild calibration artifact set if artifacts were unrepresentative
+4. Re-run full calibration from Phase 2
+---
+## Recalibration Triggers
+Recalibrate when any of these occur:
+- A criterion's `scoring_guide` or `decision_tree` is materially changed
+- New overlay is applied alongside this profile
+- Agreement drops below targets in a post-production review
+- New artifact formats appear (new diagramming tools, new document templates)
+- Agent model is updated and behaviour may differ
+- Governance expectations change materially

package/assets/init/.agents/skills/earos-create/SKILL.md ADDED Viewed

@@ -0,0 +1,257 @@
+---
+name: earos-create
+description: "Create a new EAROS rubric — core rubric, artifact profile, or cross-cutting overlay. Use this skill when someone wants to \"create a rubric\", \"new profile\", \"new overlay\", \"define criteria for\", \"make an assessment rubric for\", \"I need a rubric for\", \"how do I assess [artifact type]\", \"create evaluation criteria\", \"build a scoring framework\", \"new EAROS rubric\", \"add a rubric for [type]\", \"we don't have a rubric for\", \"extend EAROS for\", \"create evaluation standards for\", or any request to create, define, or build evaluation criteria for architecture artifacts. Also triggers on \"I need something to score [artifact type]\", \"how do I make EAROS work for [artifact type]\", \"we need criteria for [artifact type]\", or \"I want to add [artifact type] to EAROS\". This skill supersedes earos-profile-author for creating new rubrics from scratch."
+---
+# EAROS Create Skill
+You are an architecture governance consultant guiding the creation of a new EAROS rubric. This is a design process, not a form-filling exercise. A good rubric requires careful thinking before any YAML is written — the quality of the interview determines the quality of the rubric.
+**The most common failure mode:** jumping to criteria before understanding what quality looks like for this artifact type. Resist that. The questions in Step 3 are sequenced to build understanding from the ground up.
+---
+## Step 0 — Load Reference Files
+Before the interview begins, read:
+1. `earos.manifest.yaml` (repo root) — the authoritative registry; lists all existing profiles and overlays with their paths, IDs, and statuses
+2. `core/core-meta-rubric.yaml` — understand what the core already covers (never duplicate it)
+Use the manifest to show the user what already exists during Step 2. Do not list `profiles/` or `overlays/` directories directly — read from the manifest.
+> **For what depends on what and how to check for conflicts**, see `references/dependency-rules.md`.
+---
+## Step 1 — Detect the Rubric Type
+Ask: **What do you want to create?**
+Present three options with brief explanations:
+| Type | Use when |
+|------|----------|
+| **Profile** | You're adding evaluation criteria for a new artifact type (post-mortems, data contracts, platform handover docs). The most common case. |
+| **Overlay** | You're adding criteria that cut across multiple artifact types (AI governance, resilience, cost transparency). Applied by context, not artifact type. |
+| **Core rubric** | You're replacing or extending the universal foundation. This affects every artifact type. Rare — usually a governance decision. |
+If the user is unsure:
+- "Is this something unique to a specific artifact type, or would it apply to many types?" → profile vs. overlay
+- "Are you building on top of the existing core, or replacing its foundations?" → profile/overlay vs. core rubric
+> **For the full profile-vs-overlay decision framework**, see `references/dependency-rules.md#profile-vs-overlay`.
+---
+## Step 2 — Check Dependencies
+**For a core rubric:**
+- No dependencies. Warn: changing the core affects all profiles and overlays.
+- Ask: "Is the existing `EAROS-CORE-002` insufficient, or do you need a supplementary core for a specific domain? Modifying the core requires a governance decision."
+**For a profile:**
+- Confirm `core/core-meta-rubric.yaml` exists. If it doesn't: "No core rubric exists. Create one first, or proceed with a standalone profile?"
+- Show existing profiles. Ask: "Does a profile for this artifact type already exist? Here's what we have: [list]. Is this a new type or a revision of an existing one?"
+**For an overlay:**
+- Show existing profiles and overlays.
+- Ask: "Which artifact types will this overlay apply to? Here are the current profiles: [list]. Confirm this overlay is additive — it should not duplicate concerns covered in existing overlays."
+- Check: does a similar overlay already exist? (e.g., if they want "AI ethics", does the security overlay already cover it?)
+> **For detailed dependency checks**, see `references/dependency-rules.md`.
+---
+## Step 3 — Guided Interview
+Work through these questions **one topic at a time**. Don't list them all at once. Wait for the answer before asking the next. The goal is to understand the artifact type well enough to write criteria that reliably distinguish a strong artifact from a weak one.
+**I explain why each question matters — that's intentional. Understanding the reason helps you give better answers.**
+### 3a — Artifact Identity
+1. **What is this artifact type?** Name and one-sentence definition.
+   *(Why: Drives rubric_id, artifact_type field, and dimension names.)*
+2. **What decision does this artifact support?** Who reads it and what do they do with the information?
+   *(Why: EAROS criteria are always tied to decision-support. Criteria for "helping the Architecture Board approve" differ from "helping a delivery team know what to build".)*
+3. **How often does this artifact type appear — and what are the stakes?**
+   *(Why: High-frequency/low-stakes artifacts need lightweight criteria. Low-frequency/high-stakes artifacts justify more gates.)*
+### 3b — Quality Markers
+4. **Describe a great version of this artifact you've seen.** What made it stand out?
+   *(Why: Positive markers are harder to articulate than failure modes. Good examples generate the level 4 scoring guide descriptors — the hardest ones to write.)*
+5. **What does a bad version look like?** The 3 most common ways this artifact type fails.
+   *(Why: Common failures become `anti_patterns` and level 0–1 scoring guide descriptors. These are the most effective disambiguation tools for AI agents.)*
+6. **What's missing from an average version?** The artifact exists, the author tried — but something's always not quite there.
+   *(Why: This generates the level 2–3 descriptors — the most critical for calibration, since most artifacts land in this range.)*
+### 3c — Structure and Method
+7. **What are the 3–5 most important things a reviewer must check?** These become your candidate criteria.
+   *(Why: If you can't name 5 things, you'll over-specify. Naming them forces prioritisation before writing.)*
+8. **Which design method fits best?**
+   | Method | Best For |
+   |--------|----------|
+   | A — Decision-Centred | ADRs, investment reviews, exception requests |
+   | B — Viewpoint-Centred | Capability maps, reference architectures |
+   | C — Lifecycle-Centred | Transition designs, roadmaps, handover docs |
+   | D — Risk-Centred | Security, regulatory, resilience assessments |
+   | E — Pattern-Library | Recurring platform blueprints, reference patterns |
+   *(Why: The design method shapes the dimensional structure and where to put emphasis.)*
+### 3d — Gates and Stakes
+9. **What would make you reject this artifact outright, no matter how well-written the rest is?**
+   *(Why: That's your critical gate. At most 1–2 of these — if everything is critical, nothing is.)*
+10. **What would make you say "passes with conditions"?**
+    *(Why: That's major gate territory — serious enough to cap the outcome but not an automatic reject.)*
+> **For gate guidance with worked examples**, see `references/criterion-writing-guide.md#gate-guidance`.
+---
+## Step 4 — Draft the Rubric YAML
+Use the interview answers to generate the complete YAML. Before drafting, read:
+- `templates/new-profile.template.yaml` — the scaffold to start from
+- `references/criterion-writing-guide.md` — all 13 v2 required fields with worked examples
+**Profile header skeleton:**
+```yaml
+rubric_id: EAROS-[ARTIFACT]-001
+version: 1.0.0
+kind: profile
+title: "[Artifact Type] Profile"
+status: draft
+effective_date: "[today]"
+next_review_date: "[6 months from today]"
+owner: enterprise-architecture
+artifact_type: [artifact_type_snake_case]
+inherits:
+  - EAROS-CORE-002
+design_method: [method from Step 3c]
+```
+**Overlay header skeleton:**
+```yaml
+rubric_id: EAROS-OVR-[CONCERN]-001
+version: 1.0.0
+kind: overlay
+title: "[Concern] Overlay"
+status: draft
+effective_date: "[today]"
+artifact_type: any
+# No 'inherits' field
+scoring:
+  method: append_to_base_rubric
+```
+**Criteria count targets:**
+- Profile: 5–12 criteria across 2–6 dimensions (core has 10; don't add more than needed)
+- Overlay: 2–6 criteria (overlays inject focused concerns, not full rubrics)
+- Core rubric: 8–12 criteria across 6–10 dimensions
+Every criterion must have all 13 v2 fields. See `references/criterion-writing-guide.md` for the complete field list with a worked example.
+**ID assignment:**
+- Check `core/`, `profiles/`, and `overlays/` for existing IDs before assigning.
+- Profile criteria: `[ARTIFACT-ABBREV]-[TOPIC]-[NN]` — e.g., `PM-ROOT-01` for post-mortem root cause
+- Overlay criteria: `[OVR-ABBREV]-[TOPIC]-[NN]` — e.g., `AI-TRANS-01` for AI transparency
+---
+## Step 5 — Validate
+After generating YAML, run these checks before presenting to the user.
+> **Full validation checklist in `references/validation-checklist.md`.**
+Quick checks:
+1. **ID uniqueness**: no criterion ID matches anything in `core/`, `profiles/`, or `overlays/`
+2. **Criteria count**: 5–12 for profiles, 2–6 for overlays
+3. **All 13 v2 fields present** on every criterion (see `references/criterion-writing-guide.md`)
+4. **Gate balance**: at most 1–2 major gates, 0–1 critical gates per profile
+5. **Core overlap**: no criterion duplicates what `EAROS-CORE-002` already covers
+6. **Schema conformance**: structure matches `standard/schemas/rubric.schema.json`
+Tell the user: "Run `earos-validate` after placing the file to catch any remaining schema errors."
+---
+## Step 6 — Calibration Readiness
+Before the rubric can be used in a live governance process it must be calibrated. Give the user:
+```
+Pre-Calibration
+[ ] 3+ real artifacts collected:
+    - 1 strong artifact (expected overall score >= 3.2)
+    - 1 weak artifact (expected overall score < 2.4)
+    - 1 ambiguous artifact (borderline 2.4–3.2)
+[ ] 2+ reviewers identified (at least one domain expert)
+[ ] YAML complete and schema-valid (run earos-validate)
+Calibration Run
+[ ] Each reviewer scores all artifacts independently
+[ ] Cohen's kappa computed per criterion:
+    - Target > 0.70 for well-defined, observable criteria
+    - Target > 0.50 for subjective or judgment-heavy criteria
+[ ] Disagreements of >= 2 points identified and resolved against level descriptors
+[ ] decision_tree entries updated where disagreements clustered
+Post-Calibration
+[ ] Profile status: draft → candidate
+[ ] Worked example saved to examples/example-[artifact-type].evaluation.yaml
+[ ] CHANGELOG.md updated
+[ ] earos-validate run on complete repository
+```
+---
+## Step 7 — File Placement
+- Profiles → `profiles/<artifact-type>.yaml` (kebab-case, lowercase)
+- Overlays → `overlays/<concern>.yaml`
+- Core rubric → `core/<name>.yaml`
+After placing the file, **update `earos.manifest.yaml`** by running:
+```
+node tools/editor/bin.js manifest add <path/to/new-file.yaml>
+```
+This registers the new rubric in the manifest so skills, the editor sidebar, and the validate check can discover it automatically. If the CLI is not available, manually add an entry under the correct section (`profiles`, `overlays`, or `core`) in `earos.manifest.yaml`.
+Remind the user to run `earos-validate` after placing the file and before committing.
+---
+## Non-Negotiable Rules
+1. **Interview before YAML.** Never generate the rubric before completing Steps 3a–3d.
+2. **Never duplicate core criteria.** Read `EAROS-CORE-002` before finalising criteria.
+3. **5–12 criteria for profiles, 2–6 for overlays.** More criteria = less reliable calibration.
+4. **All 13 v2 fields on every criterion.** Incomplete criteria cannot be calibrated.
+5. **Gates are rare.** At most 1–2 major gates and 0–1 critical gates per profile. Over-gating creates false rejects.
+6. **Calibrate before production.** `status: draft` must not be used in a live governance process.
+7. **Explain your questions.** Tell the user why each question matters — this is a design conversation, not an interrogation.
+---
+## When to Read Which Reference File
+| When | Read |
+|------|------|
+| Checking what already exists | `references/dependency-rules.md` |
+| Deciding profile vs. overlay | `references/dependency-rules.md#profile-vs-overlay` |
+| Writing criteria | `references/criterion-writing-guide.md` |
+| Deciding gate types and weights | `references/criterion-writing-guide.md#gate-guidance` |
+| Deepening the interview | `references/rubric-interview-guide.md` |
+| Before publishing the file | `references/validation-checklist.md` |