npm - @trohde/earos - Versions diffs - 1.0.0 - Mend

@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (135) hide show

package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md ADDED Viewed

@@ -0,0 +1,163 @@
+# Challenge Patterns — EAROS Review
+This file describes the 5 systemic failure modes in EAROS evaluations and how to detect each one. Read this before running the evidence audit (Phase 2).
+---
+## Why Systematic Patterns Matter
+Individual scoring errors are expected and easy to catch — a score that doesn't match the level descriptor. Systematic patterns are harder to spot: they make the evaluation internally consistent (all scores hang together) while being consistently wrong. A generosity-biased evaluator produces scores that each seem plausible in isolation; the problem only surfaces when you apply level descriptors strictly across the whole set.
+Knowing the five patterns lets you detect them quickly rather than reviewing every criterion with equal effort.
+---
+## Failure Mode 1 — Optimistic Evidence Classification
+**What it looks like:** The evaluator marks `judgment_type: observed` but the excerpt is a paraphrase, interpretation, or inference — not a direct quote or clearly stated fact.
+**Why it happens:** Evaluators unconsciously promote their interpretations to `observed` status to feel more confident. The distinction matters: `observed` evidence is more credible and defensible in governance contexts. Misclassifying `inferred` as `observed` overstates the artifact's quality.
+**How to detect:**
+- For each `observed` criterion, ask: "Could a skeptic argue this is an interpretation rather than a direct statement?"
+- Check whether the excerpt uses quotation marks (direct quote) or paraphrase language ("the section suggests...", "it appears that...")
+- Rule of thumb: `observed` + score 3 or 4 means the artifact explicitly and directly makes a strong claim. If it doesn't, the class is wrong.
+**Good example (legitimately `observed`):**
+```yaml
+criterion_id: SCP-01
+score: 3
+judgment_type: observed
+excerpt: >
+  "In scope: Payments service, Notification service, upstream Banking Core API.
+  Out of scope: Authentication (handled by IAM platform), analytics pipeline."
+```
+This is a direct quote with named elements — clearly `observed`.
+**Bad example (should be `inferred`):**
+```yaml
+criterion_id: SCP-01
+score: 3
+judgment_type: observed
+excerpt: "The document clearly defines scope boundaries across all relevant components."
+rationale: "Scope is comprehensively covered."
+```
+The excerpt is a generalization, not a quote. This should be `inferred` at most, and the score should probably be 2.
+---
+## Failure Mode 2 — Generosity Bias
+**What it looks like:** Scores of 3 where 2 is more accurate; consistent benefit-of-the-doubt across multiple criteria.
+**Why it happens:** Evaluators interpret "this section exists" as "this criterion is addressed." The EAROS 0–4 scale requires progressively stronger evidence for higher scores — existence alone is typically score 2 ("present but incomplete"). Score 3 requires "clearly addressed with adequate evidence."
+**How to detect:**
+- For every score of 3: "Does the level descriptor for '3' describe what's in the artifact, or what a good version of it would look like?"
+- Check the `scoring_guide` level 2 and 3 descriptors — the boundary is usually between "present but incomplete" (2) and "clearly addressed with adequate evidence" (3)
+- Pattern check: if 60%+ of criteria score 3, generosity bias is likely
+**Score 2 vs. 3 example using STK-01:**
+| Score | What the artifact shows |
+|-------|------------------------|
+| 2 | "Technical stakeholders and business owner listed." — Explicit but incomplete (concerns not mapped) |
+| 3 | "Stakeholders listed with their primary concerns mapped to each section." — Explicit and mostly complete |
+**Challenge question:** "If I applied the level descriptors strictly, ignoring how well-written the artifact is, what score would this be?"
+---
+## Failure Mode 3 — Missing Evidence Anchors
+**What it looks like:** Rationale cites general impressions ("The architecture appears well-structured for...") rather than specific locations ("Section 3.2 states...").
+**Why it happens:** Evaluators write rationale from memory of the artifact rather than from specific citations. The result is unverifiable — a reviewer cannot check the claim against the artifact.
+**How to detect:**
+- For each criterion: "Can I locate the specific evidence in the artifact from what the rationale says?"
+- Flag vague `evidence_refs.location`: "Section 3", "Throughout the document", "Various sections"
+- Flag rationale using evaluative language without quotes: "appears", "seems", "comprehensive", "well-structured" without a cited excerpt
+**Actionable challenge:** "The rationale cites 'Section 3' — which subsection? What does it say? The evidence anchor must be specific enough that an independent reviewer can find it."
+**Good anchor:** `"Section 2.3 Scope — page 7: 'In scope: Payments service...'"`
+**Bad anchor:** `"Section 2 contains scope information"`
+---
+## Failure Mode 4 — Gate Blindness
+**What it looks like:** A gate criterion fails (score below threshold) but is not listed in `gate_failures`, or is listed but the status doesn't reflect the correct effect.
+**Why it happens:** Evaluators compute the weighted average and set status from it, forgetting to check gates first. Or they note the low score in the criterion result but don't escalate it to a gate failure.
+**How to detect:**
+Step 1: For every criterion with `gate.enabled: true`, read the `failure_effect` field.
+Step 2: Check the criterion score against the gate threshold (typically: any `critical` gate fails if score < 2; `major` gates flag if score < 2).
+Step 3: If failed → verify it appears in `gate_failures`.
+Step 4: Verify the status matches the gate effect:
+- Any `critical` gate failure → status MUST be `reject`
+- Any `major` gate failure → status CANNOT be `pass` (must be `conditional_pass` at best)
+**Common scenario:**
+```yaml
+# CMP-01 has gate.severity: critical
+# Evaluation shows CMP-01 score: 1
+# gate_failures: []          ← ERROR
+# status: conditional_pass   ← SHOULD BE: reject
+```
+**Flag format:** `[CRITICAL] Gate missed: CMP-01 scored 1 (below critical threshold) but absent from gate_failures. Status must be 'reject', not 'conditional_pass'.`
+---
+## Failure Mode 5 — Confidence Inflation
+**What it looks like:** `confidence: high` on criteria where the evidence is thin, ambiguous, or heavily inferred.
+**Why it matters:** Confidence labels inform human reviewers which agent scores to trust. Inflated confidence misdirects reviewers away from criteria that actually need human scrutiny.
+**How to detect:**
+- `judgment_type: inferred` + `confidence: high` is almost always wrong
+- `evidence_sufficiency: partial` should have `confidence: medium` at most
+- Gate criteria with `confidence: low` must be flagged for human review — check that they are
+**Correct confidence mapping:**
+| Evidence quality | Expected confidence |
+|-----------------|---------------------|
+| Direct quote, unambiguous level match | high |
+| Paraphrase or reasonable inference, clear level match | medium |
+| Thin/ambiguous evidence, or heavy inference | low |
+| No evidence found (score 0 or N/A) | high (absence is certain) |
+---
+## Score Calibration Reference {#score-calibration}
+When challenging a specific score, use this decision process:
+1. Read the rubric's `scoring_guide` for the criterion — what does each level say?
+2. Read the `decision_tree` — what observable conditions produce each score?
+3. Apply the decision tree to the evidence cited in the evaluation record
+4. If your result differs from the primary score by ≥ 1: flag as a challenge
+5. If your result differs by ≥ 2: flag as a critical challenge (may affect status)
+**The critical boundary is 2.0:**
+- Dimensions scoring < 2.0 prevent Pass status
+- A criterion at a `major` gate scoring < 2 triggers a gate failure
+- Re-examine any criterion sitting at exactly 2.0 — this is where generosity bias appears most frequently
+---
+## Pattern Summary Table
+| Failure Mode | Key Signal | Detection Method |
+|--------------|-----------|-----------------|
+| Optimistic evidence classification | `observed` on paraphrased content | Can I find a direct quote? |
+| Generosity bias | 60%+ criteria at score 3 | Strict level descriptor check |
+| Missing evidence anchors | Vague location, evaluative language | Can I find this in the artifact from this anchor? |
+| Gate blindness | Low gate criteria not in `gate_failures` | Systematic gate threshold scan |
+| Confidence inflation | `inferred` + `confidence: high` | Evidence quality vs. confidence mapping |

package/assets/init/.agents/skills/earos-review/references/output-template.md ADDED Viewed

@@ -0,0 +1,180 @@
+# Challenger Report Template — EAROS Review
+This file contains the full output format for the challenger report. Read this before writing the report (Phase 4).
+---
+## Why This Format
+The challenger report must serve two audiences:
+1. **The primary evaluator** — who needs specific, actionable feedback on which scores to revise and why
+2. **The governance reviewer** — who needs to know whether to accept the evaluation, conditional on fixes, or reject it for re-scoring
+The format separates these concerns: the summary and critical findings tell the governance reviewer what to do; the criterion-by-criterion section tells the evaluator what to fix.
+---
+## Full Challenger Report Template
+```markdown
+# EAROS Challenger Report
+**Evaluation ID:** [from evaluation record]
+**Artifact:** [artifact title from record]
+**Primary Evaluator:** [evaluators from record]
+**Primary Status:** [status from record]
+**Primary Score:** [overall_score from record]
+**Challenger Status:** [your determination, or "Concur"]
+**Challenger Score:** [your weighted average if different, or "Concur"]
+**Review Date:** [today]
+---
+## Structural Issues
+[List schema errors found in Phase 1, each as:]
+**[SCHEMA ERROR]** [field path] — [description of issue]
+[Or: "No structural issues found."]
+---
+## Challenge Summary
+| | Count |
+|---|---|
+| Criteria reviewed | [N] |
+| Criteria agreed | [N] |
+| Criteria challenged | [N] |
+| — Over-scored | [N] |
+| — Under-scored | [N] |
+| — Evidence quality issues | [N] |
+| Gate errors | [N] |
+**Overall verdict:** [Accept as-is | Accept with noted reservations | Reject — requires re-scoring | Escalate to human reviewer]
+---
+## Critical Findings
+[Challenges that materially affect the evaluation status — list these first]
+**[CRITICAL]** [criterion_id]: [description of the critical finding]
+> **Impact:** [what changes if this is corrected — e.g., "Gate failure missed; status should be 'reject' not 'conditional_pass'"]
+> **Required action:** [what the primary evaluator must do]
+[Or: "No critical findings that affect evaluation status."]
+---
+## Systemic Patterns Detected
+[One paragraph per pattern detected, or "No systemic patterns detected."]
+**[Pattern name]** — [description of where it appears and what the effect is]
+---
+## Criterion-by-Criterion Verdicts
+| Criterion | Primary Score | Verdict | Challenger Score | Issue Type |
+|-----------|---------------|---------|-----------------|------------|
+| [ID] | [score] | Agree | — | none |
+| [ID] | [score] | Disagree | [score] | [issue type] |
+---
+## Detailed Challenge Notes
+[For each Disagree or Partial verdict:]
+### [criterion_id]: [criterion question]
+**Primary score:** [score] | **Challenger score:** [score] | **Issue:** [type]
+**What the primary evaluation claimed:**
+> "[excerpt from the evaluation record's rationale]"
+**What the rubric requires at score [primary_score]:**
+> "[level descriptor from scoring_guide]"
+**What the artifact actually contains:**
+> "[your finding from the artifact]"
+**Why this is wrong:**
+[1–3 sentences citing the specific mismatch between the evidence and the level descriptor]
+**Correct score:** [score] — [1 sentence justification citing the level descriptor]
+---
+## Recommendation
+**[Choose one:]**
+- **Accept as-is** — All scores are supported by evidence. No gate errors. Challenger concurs with primary evaluation.
+- **Accept with noted reservations** — [N] minor scoring discrepancies noted but none affect the evaluation status. Reservations listed above.
+- **Reject — requires re-scoring** — [N] criteria require correction. Specifically: [list criterion IDs with critical issues]. Re-evaluation required before governance use.
+- **Escalate to human reviewer** — [N] criteria have conflicting evidence or scope ambiguity requiring domain expertise to resolve.
+```
+---
+## Field Guidance
+### Challenger Score {#challenger-score}
+Compute the challenger overall score only if you have challenged enough criteria to change the weighted average. The formula is the same as the primary evaluation:
+```
+challenger_score = sum(revised_dimension_score × weight) / sum(dimension_weights)
+```
+Where `revised_dimension_score` is the average of your challenger scores for criteria in that dimension (replacing only the criteria you challenged; retaining primary scores for agreed criteria).
+If only 1–2 criteria are challenged and the overall score doesn't materially change, write "Concur" for the challenger score.
+### Verdict vs. Recommendation
+- **Verdict** (per criterion): agree / disagree / partial — whether the specific score is correct
+- **Recommendation** (overall): what to do with the evaluation — accept / accept with reservations / reject / escalate
+These are different judgments. You can agree with 90% of scores and still recommend rejection if the 10% include a missed critical gate failure.
+### Issue Types
+| Issue Type | Meaning |
+|------------|---------|
+| `over_scored` | Score is higher than the evidence supports |
+| `under_scored` | Score is lower than the evidence supports |
+| `evidence_unsupported` | The cited evidence does not match the rationale claim |
+| `wrong_evidence_class` | Classified `observed` when evidence is actually `inferred` or `external` |
+| `gate_missed` | Gate threshold breached but not listed in `gate_failures` |
+| `none` | No issue — agree with primary evaluation |
+### Critical Findings Section
+A finding is "critical" if correcting it would change the evaluation status. Examples:
+- A missed `critical` gate failure (status should be `reject` not `conditional_pass`)
+- A missed `major` gate failure (status should be `conditional_pass` not `pass`)
+- Multiple over-scores that bring the overall average below 3.2 (changes `pass` to `conditional_pass`)
+Non-critical findings still appear in the criterion detail section but not in Critical Findings.
+---
+## Examples of Good vs. Bad Challenge Notes
+**Good challenge note** (specific, cites level descriptor):
+> **What the primary claimed:** "The architecture addresses compliance through GDPR and ISO 27001 requirements."
+> **What the rubric requires at score 3:** "Specific controls mapped to specific design elements with named exceptions."
+> **What the artifact contains:** Section 6 mentions "GDPR applies" and "ISO 27001 compliant" with no control-to-design mapping found in full document review.
+> **Correct score:** 1 — The criterion scoring_guide level 1 states "compliance mentioned without control mapping." The primary score of 3 requires explicit control-to-design mapping which is absent.
+**Bad challenge note** (vague, no level descriptor reference):
+> "The compliance section seems insufficient. I would score this lower."
+This is not a valid challenge — it doesn't cite the level descriptor, doesn't reference specific evidence in the artifact, and gives no guidance on what to fix.

package/assets/init/.agents/skills/earos-template-fill/SKILL.md ADDED Viewed

@@ -0,0 +1,177 @@
+---
+name: earos-template-fill
+description: "Guide an artifact author through writing an EAROS-ready document. Use this skill when someone is writing or improving an architecture artifact and wants help making it pass review. Triggers on \"help me write this architecture\", \"guide me through the template\", \"what should I include\", \"how do I write a good solution architecture\", \"fill in this template\", \"make this artifact EAROS-ready\", \"what does EAROS need from this section\", \"how do I improve this before review\", \"what will this score\", \"will this pass\", \"what's missing from my architecture document\", \"help me write an ADR\", \"what sections do I need\", or any request for writing guidance on an architecture document before assessment. This skill coaches authors; earos-assess evaluates completed artifacts."
+---
+# EAROS Template Fill Skill
+You are an architecture writing coach. Your job is to help authors write architecture artifacts that will score well in EAROS assessment — not by gaming the rubric, but by addressing the real quality concerns the rubric encodes.
+**Why this matters:** The most common reason artifacts fail EAROS review is not bad architecture — it is content gaps that prevent assessors from finding the evidence they need. An author who knows the rubric criteria in advance can write to satisfy them explicitly, rather than hoping assessors will infer the right things from well-organised prose.
+The rubric is not the enemy. Every criterion it encodes reflects a real quality concern. A risk section that lacks owners isn't just "incomplete" — it means no one is accountable when the risk materialises. A scope section without assumptions isn't just "thin" — it means the reviewer can't tell what the design is contingent on.
+---
+## Step 0 — Identify Artifact Type and Load Rubric
+Read these before giving any guidance:
+1. `core/core-meta-rubric.yaml` — the universal criteria every artifact must address
+2. The matching profile:
+   - Solution architecture → `profiles/solution-architecture.yaml`
+   - Reference architecture → `profiles/reference-architecture.yaml`
+   - ADR → `profiles/adr.yaml`
+   - Capability map → `profiles/capability-map.yaml`
+   - Roadmap → `profiles/roadmap.yaml`
+If the artifact type is unclear, ask: "What type of architecture document are you writing?"
+If the user has a draft, read it — you need to know what's already there before advising what's missing.
+Tell the user: "I'm going to guide you through the EAROS criteria for a [artifact type]. I'll flag which sections are gates (failing them prevents a Pass regardless of everything else), what strong evidence looks like, and where most authors lose points."
+---
+## Step 1 — Completeness Pre-Check
+If the user has a draft, run a rapid scan and present this table:
+| Section | Present? | Notes |
+|---------|----------|-------|
+| Title and version | | |
+| Named owner/author | | |
+| Purpose and scope | | |
+| Stakeholder list | | |
+| Architecture content (diagrams, views) | | |
+| Risks/assumptions/constraints | | |
+| Compliance/standards references | | |
+| Actions and decisions | | |
+| Change history | | |
+Identify critical gaps, especially those that map to gate criteria.
+> **For which sections map to which EAROS criteria and gate types**, see `references/section-rubric-mapping.md`.
+---
+## Step 2 — Section-by-Section Guidance
+Walk the user through each criterion in the loaded rubric. Format for each criterion:
+**[Criterion ID] — [criterion question]**
+> **Why this matters:** [1–2 sentences on the real quality concern this criterion encodes — explain the consequence of getting it wrong, not just what to include]
+> **⚠️ GATE** (if `gate.enabled: true`):
+> - `major`: "Scoring below 2 here prevents a Pass status."
+> - `critical`: "Being absent or failing here triggers an automatic Reject regardless of all other scores."
+> **What you need:** [from `required_evidence` in the rubric]
+> **Strong evidence looks like:** [from `examples.good` in the rubric]
+> **Common mistakes:** [from `anti_patterns`]
+> **Prompt:** "Does your draft include [specific thing]? Paste the relevant section and I'll check it, or tell me if it's missing and I'll help you draft it."
+Process core criteria first (STK-01, STK-02, SCP-01, CVP-01, TRC-01, CON-01, RAT-01, CMP-01, ACT-01, MNT-01), then profile-specific criteria in dimension order.
+> **For section-to-criterion mappings and score 2 vs. 3 boundaries**, read `references/section-rubric-mapping.md`. For writing patterns with good/bad examples, read `references/evidence-writing-guide.md`.
+---
+## Step 3 — Section Drafting Help
+When the user provides content or asks for help drafting:
+1. Identify which EAROS criteria the content addresses
+2. Estimate what score it would get against the rubric level descriptors
+3. Suggest specific improvements using `remediation_hints` and `scoring_guide` from the rubric
+4. For gate criteria, be explicit: "This section maps to [criterion ID], which is a [major/critical] gate. Here is exactly what's needed to clear it."
+Be concrete, not vague:
+- ❌ "Add more detail about risks"
+- ✅ "Add a risk table with columns: Risk, Likelihood, Impact, Mitigation, Owner, Residual Risk. For a score of 3, include at least 3 specific named risks with mitigations and owners — not 'TBD'."
+> **For detailed writing patterns with good/bad examples for each section type**, read `references/evidence-writing-guide.md`.
+---
+## Step 4 — Pre-Submission Checklist
+Before the user submits, run through this checklist:
+```
+EAROS Pre-Submission Checklist
+================================
+Core criteria:
+[ ] STK-01: Named stakeholders with specific concerns stated
+[ ] SCP-01: Explicit scope, out-of-scope list, assumptions, constraints  <- GATE
+[ ] CVP-01: Views chosen for stated stakeholder concerns
+[ ] TRC-01: Architecture decisions traceable to business drivers
+[ ] CON-01: Consistent terminology across all sections and diagrams
+[ ] RAT-01: Risk table with mitigations and owners  <- GATE
+[ ] CMP-01: Named controls mapped to design elements  <- GATE
+[ ] ACT-01: Decision statement and named actions with owners
+[ ] MNT-01: Named owner, version, last-updated date
+Profile criteria:
+[Add profile-specific criteria from the loaded profile, flagging gates]
+Gate summary:
+[ ] No critical gate criteria are empty or failed
+[ ] No major gate criteria are likely below score 2
+Evidence readiness:
+[ ] Every significant claim is stated explicitly (not implied)
+[ ] All components have consistent names across all diagrams
+[ ] All diagrams have legends or annotations
+```
+For any unchecked items, offer to help draft the missing content.
+---
+## Step 5 — Score Estimate
+After reviewing the draft, provide an estimated score:
+```
+Estimated EAROS Score
+======================
+Criterion    | Est. Score | Confidence | Gap
+STK-01       | 3          | Medium     | Add concern-to-view mapping
+SCP-01       | 2          | High       | No assumptions listed -- GATE AT RISK
+...
+Overall estimate:  ~[X.X]
+Likely status:     [Pass | Conditional Pass | Rework Required]
+Top 3 improvements before submission:
+1. [most impactful, specific action]
+2. [second]
+3. [third]
+```
+---
+## Non-Negotiable Rules
+1. **Never compromise rigor for politeness.** If a gate criterion is empty, say so directly: "This is a critical gate — submitting without it will result in an automatic Reject."
+2. **Reference actual rubric criteria.** Every suggestion must be anchored to a criterion ID and level descriptor.
+3. **Distinguish gate from non-gate.** Clearly communicate which gaps are fatal vs. which reduce the score.
+4. **Show examples, not descriptions.** Always show what strong evidence looks like (from `examples.good`) rather than just describing what to include.
+5. **Three evaluation types are distinct.** Remind authors that artifact quality, architectural fitness, and governance fit are evaluated separately — a well-written document can still fail if the architecture it describes is unsound.
+---
+## When to Read Which Reference File
+| When | Read |
+|------|------|
+| Mapping document sections to criteria | `references/section-rubric-mapping.md` |
+| Explaining gate criteria and their thresholds | `references/section-rubric-mapping.md` |
+| Providing writing examples (good and bad) | `references/evidence-writing-guide.md` |
+| Helping draft a specific section | `references/evidence-writing-guide.md` |
+| Explaining score 2 vs. 3 differences | `references/section-rubric-mapping.md` |
+| Author asks "what does strong evidence look like?" | `references/evidence-writing-guide.md` |