npm - @trohde/earos - Versions diffs - 1.0.0 - Mend

@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (135) hide show

package/assets/init/.agents/skills/earos-report/SKILL.md ADDED Viewed

@@ -0,0 +1,85 @@
+---
+name: earos-report
+description: "Generate executive reports from EAROS evaluation records. Triggers when the user wants to generate a report, create a summary, produce an executive view, aggregate multiple evaluations, show trends, or says \"generate a report\", \"create an executive summary\", \"summarize these evaluations\", \"show me the portfolio status\", \"create a dashboard view\", \"what is the overall quality of our architecture portfolio\", or \"produce an EAROS report\"."
+---
+# EAROS Report — Executive Report Generator
+You generate executive-quality reports from one or more EAROS evaluation records. Reports are audience-aware, status-accurate, and action-oriented. A report that softens a Reject to avoid uncomfortable conversations is worse than no report.
+**Why this matters:** An evaluation record is a technical artifact — a YAML file full of criterion scores. Decision-makers need a different format: one that surfaces gate failures prominently, provides traffic-light status, ranks actions by impact, and for portfolios, identifies systemic patterns. The report is where evaluation value is realised.
+## What You Need Before Starting
+Ask the user three questions if not already clear:
+1. **Scope** — one evaluation record, or multiple (portfolio)?
+2. **Records location** — specific file path, or scan `evaluations/` and `examples/`?
+3. **Audience** — Architecture Board | Executive | Delivery team | Audit?
+The audience changes emphasis: Audit audiences get an evidence quality section. Executive audiences get fewer criterion details and more action focus.
+## Step 1 — Locate and Load Records
+Scan the specified paths. For each evaluation record:
+- Verify required fields: `evaluation_id`, `artifact_ref`, `status`, `overall_score`, `criterion_results`, `dimension_scores`
+- Note any structural issues — flag them but do not block reporting
+- Extract: status, overall_score, gate_failures, dimension_scores, top_actions, evaluation_date, evaluators
+## Step 2 — Select Report Mode
+- **1 record** → Single-artifact report (read `references/single-artifact-template.md`)
+- **2+ records** → Portfolio report (read `references/portfolio-template.md`)
+## Step 3 — Build the Report
+Read the appropriate template reference file now. Populate it from the evaluation data.
+**Traffic light rules (non-negotiable):**
+- Pass ≥ 3.2, no gate failures, no dimension < 2.0 → 🟢
+- Conditional Pass 2.4–3.19, no critical gate failures → 🟡
+- Rework Required < 2.4, or dimension < 2.0 → 🟠
+- Reject: any critical gate failure → 🔴
+- Not Reviewable: insufficient evidence → ⚫
+**Gate failures go first.** In any report format, gate failures are the most important finding. They appear in their own section immediately after the status header — never buried in a details table.
+**Actions must be specific.** "Improve traceability" is not an action.
+"Add a traceability matrix linking each business driver to the design sections that implement it (TRC-01)" is an action.
+## Step 4 — Audience Adjustments
+| Audience | Emphasis |
+|----------|----------|
+| Architecture Board | Full criterion table, gate analysis, evidence quality |
+| Executive | Status + gate failures + top 5 actions only |
+| Delivery team | Criterion detail + specific remediation steps |
+| Audit | Evidence quality section (observed / inferred / external breakdown) |
+For Audit audiences, add an evidence quality section from the template.
+## Step 5 — Save and Confirm
+Save as:
+- Single artifact: `evaluations/[evaluation_id]-report.md`
+- Portfolio: `evaluations/portfolio-report-[YYYY-MM-DD].md`
+Confirm the file path with the user. Offer Word/PDF conversion via the docx or pdf skills if needed.
+## Non-Negotiable Rules
+1. **Traffic lights must be accurate.** Do not soften Reject to Conditional Pass to ease the conversation.
+2. **Gate failures are prominent.** Never bury them in a criterion details table.
+3. **Evidence quality is separate from score.** An artifact can score 3.0 with mostly inferred evidence — this matters, especially for audit audiences.
+4. **No trends without data.** Don't synthesise trend lines from a single data point per artifact.
+5. **Portfolio health requires honest math.** Compute pass rates from actual statuses — don't round up.
+## When to Read References
+| When | Read |
+|------|------|
+| Generating a single-artifact report | `references/single-artifact-template.md` |
+| Generating a portfolio or trend report | `references/portfolio-template.md` |
+| Unsure about traffic light assignment | `references/single-artifact-template.md#status` |
+| Building the gate failure section | Either template — gate section is consistent |
+| Audit audience — evidence quality section | `references/single-artifact-template.md#evidence-quality` |

package/assets/init/.agents/skills/earos-report/references/portfolio-template.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Portfolio Report Template
+Use this template when generating a report covering 2 or more EAROS evaluation records. Populate from all evaluation records. Trend analysis requires at least 2 evaluation dates per artifact.
+---
+```markdown
+# EAROS Portfolio Assessment Report
+**Report Date:** [today's date]
+**Artifacts Assessed:** [N]
+**Period Covered:** [earliest evaluation_date] to [latest evaluation_date]
+**Report Generated By:** [agent/human]
+**Rubrics Applied:** [list unique rubric_ids across all evaluations]
+---
+## Portfolio Health Dashboard
+| Status | Count | % of portfolio |
+|--------|-------|---------------|
+| 🟢 Pass | [N] | [%] |
+| 🟡 Conditional Pass | [N] | [%] |
+| 🟠 Rework Required | [N] | [%] |
+| 🔴 Reject | [N] | [%] |
+| ⚫ Not Reviewable | [N] | [%] |
+| **Total** | **[N]** | **100%** |
+**Portfolio Health:** [classification below]
+- **Good** (≥70% Pass or Conditional Pass): Architecture governance is producing acceptable quality
+- **Needs Attention** (40–69% Pass or Conditional Pass): Systemic quality issues evident; intervention recommended
+- **Poor** (<40% Pass or Conditional Pass): Governance process or artifact quality requires immediate attention
+**Summary statement:** [1–2 sentence overall assessment]
+---
+## Artifact Summary
+| Artifact | Type | Eval Date | Status | Score | Gate Failures |
+|----------|------|-----------|--------|-------|--------------|
+| [title] | [type] | [date] | [🟢/🟡/🟠/🔴/⚫] [status] | [X.X] | [N or None] |
+| ... | | | | | |
+**Sort order:** By status (Reject first, then Rework, then Conditional, then Pass), then by score ascending within each group. This surfaces the most urgent issues first.
+---
+## Dimension Analysis
+*Average scores across all artifacts, by dimension.*
+| Dimension | Avg Score | Lowest | Highest | Below 2.0 Count |
+|-----------|-----------|--------|---------|-----------------|
+| [dim name] | [avg] | [min — artifact] | [max — artifact] | [N artifacts] |
+| ... | | | | |
+**Systemic weaknesses** (dimensions with average < 2.5 across portfolio):
+[List each weak dimension and note which artifact types or teams it affects]
+**Portfolio strengths** (dimensions with average ≥ 3.2 across portfolio):
+[List each strong dimension]
+---
+## Gate Failure Analysis
+*Which gate criteria are failing and how often.*
+[If no gate failures across portfolio:]
+> No gate failures across the portfolio.
+[If gate failures present:]
+| Criterion | Gate Severity | Artifacts Failed | Failure Rate | Status Effect |
+|-----------|--------------|-----------------|-------------|---------------|
+| [ID] | [critical/major] | [N] ([list artifact names]) | [%] | [reject/cap at conditional_pass] |
+| ... | | | | |
+**Systemic gate failures** (criterion failing in ≥30% of artifacts):
+[Analysis of why the same gate keeps failing — is it a rubric clarity issue, an authoring practice issue, or a genuine governance gap?]
+> If a gate criterion fails across multiple artifacts from different teams, it suggests a shared knowledge gap rather than individual team failures. Consider: training, template improvements, or rubric clarification.
+---
+## Trend Analysis
+*Only include if 2+ evaluations exist for any artifact, or the portfolio spans 2+ time periods.*
+| Artifact | Previous Score | Current Score | Change | Status Change |
+|----------|---------------|---------------|--------|---------------|
+| [title] | [X.X] ([date]) | [X.X] ([date]) | [+/-X.X] | [same/improved/declined] |
+| ... | | | | |
+**Portfolio trend:** [Improving / Stable / Declining]
+- Average score change: [+/-X.X] since [period]
+- Pass rate change: [+/-X%] since [period]
+**Notable changes:**
+- [Artifact that improved most significantly and why]
+- [Artifact that declined most significantly and why]
+---
+## Dimension Trends
+*Only include if multiple evaluation periods exist.*
+| Dimension | [Period 1] Avg | [Period 2] Avg | Trend |
+|-----------|---------------|---------------|-------|
+| [dim name] | [X.X] | [X.X] | [↑ improving / → stable / ↓ declining] |
+---
+## Evidence Quality Summary
+*Aggregate evidence quality across all evaluations.*
+| Evidence Class | Total Criteria | % of scored criteria |
+|---------------|---------------|---------------------|
+| Observed | [N] | [%] |
+| Inferred | [N] | [%] |
+| External | [N] | [%] |
+**Portfolio evidence reliability:** [Strong (>80% observed) / Moderate / Low]
+[If evidence reliability is Low:]
+> ⚠️ The portfolio has high inferred evidence ratios. This may indicate that artifacts are not making content explicit enough for assessors to find direct evidence. Recommend sharing the EAROS evidence writing guide with artifact authors.
+---
+## Portfolio Recommendations
+*Top systemic improvements. Focus on issues affecting multiple artifacts.*
+| # | Recommendation | Artifacts Affected | Expected Impact |
+|---|---------------|-------------------|-----------------|
+| 1 | [Specific, actionable recommendation] | [N artifacts] | [Pass rate / score improvement] |
+| 2 | [Recommendation] | [N artifacts] | [Impact] |
+| ... | | | |
+**Most impactful single action:** [The one change that would most improve portfolio health]
+---
+## Individual Artifact Summaries
+*3-line summary per artifact: status, top strength, top gap.*
+**[Artifact Title]** — 🟢/🟡/🟠/🔴/⚫ [Status] | Score: [X.X]
+- Strength: [Specific high-scoring criterion and evidence]
+- Gap: [Most critical low-scoring or gate criterion]
+- Action: [Single most important improvement]
+[Repeat for each artifact]
+---
+## Appendix: Criterion Heat Map
+*Score distribution across all artifacts by criterion.*
+| Criterion | Avg | [Art 1] | [Art 2] | [Art 3] | ... |
+|-----------|-----|---------|---------|---------|-----|
+| [ID] | [avg] | [score] | [score] | [score] | |
+| ... | | | | | |
+**Colour key:** 🟢 Score ≥3 | 🟡 Score 2 | 🟠 Score 1 | 🔴 Score 0 | — N/A
+```
+---
+## Template Notes
+- **Portfolio health classification** is computed from actual statuses — do not estimate
+- **Systemic gate failures** are the most important finding in a portfolio report — always investigate and explain
+- **Trend analysis** requires dates — never synthesise trends from a single evaluation per artifact
+- **Individual summaries** should be 3 lines maximum — detail belongs in per-artifact reports
+- **Recommendations** must be systemic (affecting multiple artifacts) — individual artifact actions belong in the single-artifact report
+- Save as: `evaluations/portfolio-report-[YYYY-MM-DD].md`

package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md ADDED Viewed

@@ -0,0 +1,168 @@
+# Single-Artifact Report Template
+Use this template when generating a report for one EAROS evaluation record. Populate all fields from the evaluation record. Do not omit sections — even if empty, they must appear to confirm they were checked.
+---
+```markdown
+# EAROS Assessment Report
+**Artifact:** [artifact_ref.title]
+**Artifact Type:** [artifact_ref.artifact_type]
+**Owner:** [artifact_ref.owner or "Not specified"]
+**Version:** [artifact_ref.version or "Not specified"]
+**Evaluation Date:** [evaluation_date]
+**Evaluated By:** [evaluators list — names and modes (agent/human/hybrid)]
+**Rubric Applied:** [rubric_id, including profile and any overlays]
+---
+## Status {#status}
+| | |
+|---|---|
+| **Overall Status** | [TRAFFIC LIGHT] [STATUS LABEL] |
+| **Overall Score** | [overall_score] / 4.0 |
+| **Gate Failures** | [count or "None"] |
+**Traffic light assignment:**
+- 🟢 **Pass** — No gate failures. Overall ≥ 3.2. No dimension < 2.0.
+- 🟡 **Conditional Pass** — No critical gate failures. Overall 2.4–3.19. Named remediation items required.
+- 🟠 **Rework Required** — Overall < 2.4, or repeated weak dimensions, or insufficient evidence.
+- 🔴 **Reject** — Critical gate failure. Cannot be approved regardless of overall score.
+- ⚫ **Not Reviewable** — Evidence too incomplete to score key criteria.
+---
+## Gate Failures
+[If gate_failures list is empty:]
+> No gate failures — all gate criteria cleared.
+[If gate_failures present:]
+> ⚠️ Gate failures override the weighted average. The status below reflects the gate outcome, not just the score.
+| Criterion | Gate Severity | Score | Effect |
+|-----------|--------------|-------|--------|
+| [criterion_id] | [critical/major] | [score] | [failure_effect from rubric] |
+---
+## Dimension Scorecard
+| Dimension | Score | Weight | Status |
+|-----------|-------|--------|--------|
+| [dimension_name] | [score] / 4.0 | [weight] | [🟢 ≥3.2 / 🟡 2.4–3.19 / 🟠 <2.4] |
+| ... | | | |
+| **Weighted Overall** | **[overall_score]** | | [status label] |
+**Dimension floor check:** [Pass / Fail — any dimension < 2.0?]
+If any dimension < 2.0: "⚠️ [Dimension name] scored [score] — below the 2.0 floor required for Pass status."
+---
+## Key Findings
+### Strengths
+[3 bullet points. For each: criterion ID, score, and the specific evidence that earns it. One sentence per bullet.]
+- **[Criterion ID] — [criterion name]** (score [X]): [Specific evidence that demonstrates strength]
+- **[Criterion ID] — [criterion name]** (score [X]): [Specific evidence]
+- **[Criterion ID] — [criterion name]** (score [X]): [Specific evidence]
+### Critical Gaps
+[3–5 bullet points. For each: criterion ID, score, and what specifically is missing. Gate criteria listed first.]
+- **[Criterion ID] — [criterion name]** (score [X]) [⚠️ GATE]: [What is missing and why it matters]
+- **[Criterion ID] — [criterion name]** (score [X]): [What is missing]
+- ...
+---
+## Recommended Actions
+Actions are ordered by priority. Gate-related actions must be addressed before resubmission.
+| # | Action | Criterion | Priority | Suggested Owner |
+|---|--------|-----------|----------|-----------------|
+| 1 | [Specific, verb-first action] | [ID] | Critical — Gate | [role] |
+| 2 | [Specific action] | [ID] | High | [role] |
+| 3 | [Specific action] | [ID] | Medium | [role] |
+| ... | | | | |
+**Action quality standard:** "Add a scope section listing in-scope and out-of-scope items explicitly, including at least 3 named assumptions (SCP-01)" is a good action. "Improve scope" is not.
+---
+## Detailed Criterion Results
+| Criterion | Dimension | Score | Conf. | Evidence Class | N/A Reason |
+|-----------|-----------|-------|-------|---------------|------------|
+| [ID] | [dim] | [0-4/N/A] | [H/M/L] | [obs/inf/ext] | [if N/A: reason] |
+| ... | | | | | |
+---
+## Evidence Quality Summary {#evidence-quality}
+*Include for Architecture Board and Audit audiences. Optional for delivery teams.*
+| Evidence Type | Count | % of scored criteria |
+|--------------|-------|---------------------|
+| Observed (direct quote or section reference) | [N] | [%] |
+| Inferred (reasonable interpretation) | [N] | [%] |
+| External (based on standard or policy) | [N] | [%] |
+| N/A (criterion genuinely not applicable) | [N] | [%] |
+**Evidence reliability:** [Strong (>80% observed) / Moderate (50–80%) / Low (<50%)]
+[If low confidence count > 20%:]
+> ⚠️ [N] criteria scored with low confidence. Human reviewer judgment recommended for these before relying on this assessment for governance decisions.
+---
+## Assessment Notes
+[narrative_summary from the evaluation record, if present. Otherwise: synthesise a 2–3 paragraph summary covering:]
+1. What the artifact does well and why
+2. The primary gaps and their downstream consequences
+3. The recommended path forward (resubmit after addressing X, escalate to Y, etc.)
+---
+## Evaluation Metadata
+| Field | Value |
+|-------|-------|
+| Evaluation ID | [evaluation_id] |
+| Rubric version | [rubric version] |
+| Evaluation mode | [agent / human / hybrid] |
+| Calibration status | [calibrated / uncalibrated — if uncalibrated, note this] |
+| Challenger review | [completed / not performed] |
+```
+---
+## Challenger Score {#challenger-score}
+If a challenger review was performed (earos-review skill), add this section:
+```markdown
+## Challenger Review
+**Challenger Status:** [Concur with primary / Revised to: (status)]
+**Challenger Score:** [Concur / (score) — differs from primary by (delta)]
+**Challenging findings:** [N] BLOCKER / [N] MAJOR / [N] MINOR
+[For each BLOCKER or MAJOR finding: criterion ID, primary score, challenger score, reason]
+```
+---
+## Formatting Notes
+- Gate failures must appear prominently — in their own section immediately after the status, not buried in the criterion table
+- Traffic light emoji must be accurate — do not use 🟡 when the status is Reject
+- Actions must be specific — verb first, criterion reference included
+- Scores like "2.8" mean the weighted average is 2.8, not that each criterion scored 2.8
+- For scores of N/A: the criterion is excluded from the denominator — document why

package/assets/init/.agents/skills/earos-review/SKILL.md ADDED Viewed

@@ -0,0 +1,130 @@
+---
+name: earos-review
+description: "Challenge and peer-review an existing EAROS evaluation record. Use this skill whenever someone wants to audit, second-opinion, or challenge a completed evaluation. Triggers on \"check this evaluation\", \"challenge these scores\", \"review the assessment\", \"second opinion on this\", \"audit this EAROS record\", \"are these scores right\", \"was this evaluation fair\", \"over-scored\", \"too generous\", \"missed a gate failure\", \"verify this assessment\", \"quality check this evaluation\", or any request to validate evaluation quality. Also triggers when a YAML evaluation record is provided alongside the original artifact and the user asks for a quality check. This is distinct from earos-assess (which runs a fresh evaluation) — earos-review audits an existing one."
+---
+# EAROS Review (Challenger) Skill
+You are the challenger evaluator. Your job is not to re-evaluate the artifact from scratch — it is to audit the evaluation record itself. You check whether the primary evaluator's scores are supported by the evidence they cited, consistent with the rubric's level descriptors, and free from the systematic biases that plague architecture assessment.
+**Why this matters:** The most common failure modes in EAROS evaluation are not random errors — they are systematic: over-scoring well-written prose, misclassifying inferred evidence as observed, and missing gate failures that change the final status. A challenger who knows what to look for catches these reliably. Without a challenge pass, inflated evaluations reach governance boards unchecked.
+**Before running Phase 2:** Read `references/challenge-patterns.md`. It describes the 5 systemic failure modes with detection guidance and examples.
+---
+## Inputs Required
+You need three things. If any are missing, ask before proceeding:
+1. **The evaluation record** — a YAML file (usually in `evaluations/` or `examples/`)
+2. **The original artifact** — the document or design that was evaluated
+3. **The rubric files** — identified by `rubric_id` in the evaluation record; load from `core/`, `profiles/`, `overlays/`
+Also load `standard/schemas/evaluation.schema.json` for structural validation.
+---
+## Phase 1 — Schema and Structural Check
+*Purpose: catch invisible errors — missing fields, skipped criteria, inconsistent status.*
+Check that the evaluation record has:
+- [ ] All required fields: `evaluation_id`, `rubric_id`, `artifact_ref`, `evaluation_date`, `evaluators`, `status`, `overall_score`, `criterion_results`
+- [ ] Every criterion from the loaded rubric appears in `criterion_results` — silently skipped criteria are a red flag
+- [ ] `gate_failures` field present (even if empty)
+- [ ] `recommended_actions` present
+- [ ] Each criterion result has: `score`, `judgment_type`, `confidence`, `evidence_refs`, `rationale`
+- [ ] Status is internally consistent: a `pass` status with a `critical` gate failure is an error
+Flag every structural violation as **[SCHEMA ERROR]** in the output.
+---
+## Phase 2 — Evidence Audit
+*Purpose: determine whether each score is supported by actual artifact content.*
+> **Read `references/challenge-patterns.md` before this phase.** It contains detection methods for each failure mode with good and bad examples.
+For each criterion in the evaluation record:
+**A. Evidence support check**
+- Locate the `evidence_refs` cited in the evaluation
+- Find those sections in the original artifact
+- Does the excerpt actually say what the rationale claims? Watch for paraphrase-creep — where the evaluator's interpretation gets attributed to the artifact
+- Is the `judgment_type` accurate?
+  - `observed` requires a direct quote or clearly stated fact
+  - If the evaluator inferred it → should be `inferred`
+  - If they applied outside knowledge → should be `external`
+**B. Score calibration check**
+- Read the `scoring_guide` in the rubric for this criterion
+- Does the score match the level descriptor?
+- For scores of 3 or 4: "Is this genuinely well evidenced, or benefit of the doubt?"
+- For scores of 0 or 1: "Did the evaluator search thoroughly?"
+**C. Gate check**
+- If `gate.enabled: true`: check the score against the gate threshold
+- If the score fails the gate — is it listed in `gate_failures`?
+- If listed as a gate failure — does the status reflect the correct effect?
+Record your verdict per criterion:
+```
+criterion_id: [ID]
+primary_score: [from record]
+challenger_verdict: agree | disagree | partial
+challenger_score: [your score if different]
+issue_type: over_scored | under_scored | evidence_unsupported | wrong_evidence_class | gate_missed | none
+challenge_note: "[specific reason citing the rubric level descriptor]"
+```
+---
+## Phase 3 — Systemic Pattern Analysis
+*Purpose: identify whether the evaluation has a systematic bias, not just isolated errors.*
+After reviewing all criteria, look for patterns across the full set:
+1. **Optimistic evidence classification** — multiple criteria marked `observed` where evidence is actually `inferred`
+2. **Generosity bias** — consistently scoring 3 where 2 is more accurate; benefit-of-the-doubt pattern-wide
+3. **Missing evidence anchors** — rationale cites general impressions rather than specific locations
+4. **Gate blindness** — gate criteria failed but not in `gate_failures`, or status doesn't reflect gate effects
+5. **Confidence inflation** — `high` confidence on criteria with thin or inferred evidence
+> **For examples of each pattern and how to detect them**, see `references/challenge-patterns.md`.
+---
+## Phase 4 — Overall Assessment
+Compute:
+- Criteria agreed / challenged / evidence-quality issues / gate errors
+- Your challenger overall score (if revised scores produce a different weighted average)
+- Your challenger status recommendation (if it differs from the primary)
+> **Read `references/output-template.md` before writing the report.** It contains the full format with field-by-field guidance.
+---
+## Non-Negotiable Rules
+1. **Don't soften challenges.** If the evidence doesn't support the score, say so clearly and cite the level descriptor.
+2. **Don't re-score without evidence.** If you cannot find support for a different score in the artifact, do not challenge.
+3. **Gate errors are critical findings.** A missed gate failure that changes the status is not a minor issue — flag it prominently.
+4. **The three evaluation types are distinct.** Check whether artifact quality, architectural fitness, and governance fit have been collapsed into a single judgment.
+5. **Reference level descriptors.** Every disagreement must cite the specific descriptor the primary evaluator should have applied.
+---
+## When to Read Which Reference File
+| When | Read |
+|------|------|
+| Before Phase 2 (always) | `references/challenge-patterns.md` |
+| Detecting a specific failure mode | `references/challenge-patterns.md` |
+| Before writing the challenger report | `references/output-template.md` |
+| Unsure whether to challenge a score | `references/challenge-patterns.md#score-calibration` |
+| Computing challenger overall score | `references/output-template.md#challenger-score` |