npm - @trohde/earos - Versions diffs - 1.0.0 - Mend

@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (135) hide show

package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md ADDED Viewed

@@ -0,0 +1,281 @@
+# Scoring Protocol — EAROS Assessment
+This file contains the detailed evidence and scoring guidance for EAROS evaluators. Read this before Step 2 (content extraction). Return to it any time scoring feels ambiguous.
+---
+## The RULERS Protocol — Evidence-Anchored Scoring
+RULERS stands for the core principle: **every score must be anchored to a retrievable unit of evidence from the artifact**. This prevents the single most common failure in architecture assessment — scoring from overall impression rather than specific evidence.
+### Why it matters
+An agent (or human) that scores from impression tends to:
+- Over-score well-written artifacts (fluent prose feels like good architecture)
+- Under-score technical artifacts (dense notation looks like poor communication)
+- Produce unreproducible scores (two reviewers get different results on the same artifact)
+RULERS scoring is reproducible because it ties every score to a specific excerpt. If two reviewers disagree, they can compare evidence anchors rather than arguing from impressions.
+### The three evidence classes
+Before assigning any score, classify the evidence you found:
+| Class | Definition | Credibility |
+|-------|------------|-------------|
+| `observed` | Directly supported by a quote or excerpt from the artifact | Highest — the artifact says it |
+| `inferred` | Reasonable interpretation not directly stated, but logically implied | Medium — the artifact implies it |
+| `external` | Judgment based on a standard, policy, or pattern outside the artifact | Lowest — you brought the knowledge |
+**The discipline:** If you cannot find at least `inferred` evidence, you cannot score the criterion above 1. A score of 0 or 1 with `evidence_class: none` is legitimate — it means the artifact does not address the criterion.
+### Evidence extraction steps (for each criterion)
+1. Read the criterion's `required_evidence` list in the rubric YAML
+2. Search the artifact specifically for each piece of required evidence
+3. Record:
+   - `evidence_anchor`: where in the artifact (section heading, page number, diagram label, appendix name)
+   - `excerpt`: direct quote or very close paraphrase — must be retrievable, not invented
+   - `evidence_class`: observed / inferred / external / none
+---
+## How to Use scoring_guide and decision_tree
+The rubric YAML for each criterion contains two fields that tell you exactly what each score means:
+### scoring_guide
+The `scoring_guide` gives one-sentence level descriptors for scores 0–4. These are the authoritative definitions. Examples from the core rubric:
+**STK-01 (stakeholder identification):**
+```
+"0": Absent or contradicted
+"1": Implied only
+"2": Explicit but incomplete
+"3": Explicit and mostly complete
+"4": Explicit, complete, and used consistently
+```
+**SCP-01 (scope and boundaries):**
+```
+"0": No scope or boundary
+"1": Scope is ambiguous
+"2": Basic scope exists but is incomplete
+"3": Scope and boundaries are clear
+"4": Scope and boundaries are clear, tested, and internally consistent
+```
+Use these as your first reference. If the artifact clearly matches one level, assign that score.
+### decision_tree
+The `decision_tree` translates the scoring guide into a sequence of observable conditions. Use this when the scoring guide alone doesn't resolve the case. Example from SCP-01:
+```
+IF no scope section THEN score 0.
+IF scope exists but no exclusions listed THEN max score 2.
+IF assumptions not stated THEN max score 3.
+```
+The decision tree is a ceiling, not a floor. It tells you the maximum score given observable conditions.
+**Working together:** Match the artifact against the scoring guide first. If ambiguous, walk the decision tree as a tiebreaker.
+---
+## Examples of GOOD Scoring
+A good criterion result is specific, grounded, and honest about evidence quality.
+### Example 1 — Observed evidence, high confidence
+```yaml
+criterion_id: SCP-01
+score: 3
+evidence_class: observed
+confidence: high
+evidence_anchor: "Section 1.2 Scope"
+excerpt: >
+  "In scope: Payments service, Notification service, upstream Banking Core API.
+  Out of scope: Authentication (handled by IAM platform — see IAM-2024-001),
+  analytics pipeline, reporting layer. Assumptions: Banking Core API versioned
+  contract stable for 12 months."
+rationale: >
+  Scope is explicitly bounded with named in-scope and out-of-scope elements,
+  and a key assumption is stated with a consequence. Assumptions are present
+  but no in-scope/out-of-scope boundary diagram or consistency test is provided,
+  which would be required for a score of 4.
+```
+Why this is good:
+- The excerpt is a direct quote from a specific section
+- The rationale explains both why the score is 3 and why it's not 4
+- Evidence class is correctly classified as `observed` (directly stated)
+### Example 2 — Inferred evidence, medium confidence
+```yaml
+criterion_id: TRC-01
+score: 2
+evidence_class: inferred
+confidence: medium
+evidence_anchor: "Section 2.1 Drivers and Section 4 Architecture Decisions"
+excerpt: >
+  "Section 2.1 lists availability and time-to-market as drivers. Section 4
+  adopts an event-driven pattern described as 'enabling independent scaling'."
+rationale: >
+  The connection between the availability driver and the event-driven decision
+  is implied by the phrase 'enabling independent scaling', but no explicit
+  traceability link is made. The decision does not reference the driver ID
+  or requirement number. This is inferred, not observed — the link exists
+  but the artifact doesn't draw it explicitly. Score 2 (partial) is appropriate;
+  score 3 would require explicit driver-to-decision links.
+```
+Why this is good:
+- Honestly classified as `inferred` rather than `observed`
+- The rationale explains the gap clearly and cites the decision tree logic
+- Medium confidence correctly reflects the inferential nature of the evidence
+### Example 3 — Absent evidence, scoring 0
+```yaml
+criterion_id: RAT-01
+score: 0
+evidence_class: none
+confidence: high
+evidence_anchor: "Entire document reviewed"
+excerpt: "No risks, assumptions, constraints, or trade-offs section found."
+rationale: >
+  The artifact contains no RAID log, no risk section, and no trade-off discussion.
+  The word 'risk' appears once in passing ('we need to consider operational risk')
+  but with no specific identification, owner, or mitigation. This does not meet
+  the scoring_guide level 1 threshold ('mentioned superficially'). Score 0.
+```
+Why this is good:
+- Confidence is high because the absence is certain
+- The one occurrence of "risk" is acknowledged but correctly scored as below level 1
+- Evidence anchor notes that the whole document was reviewed
+---
+## Examples of BAD Scoring
+### Bad Example 1 — Impression-based scoring
+```yaml
+criterion_id: CMP-01
+score: 3
+evidence_class: observed
+confidence: high
+evidence_anchor: "Section 5"
+excerpt: "The architecture appears to be well-aligned with enterprise standards."
+rationale: "The document seems to address compliance comprehensively."
+```
+Problems:
+- "appears to be well-aligned" is the evaluator's impression, not a quote from the artifact
+- The excerpt is a vague paraphrase, not specific evidence
+- No specific standard or control is named
+- `evidence_class: observed` is wrong — nothing is directly quoted
+- This scores CMP-01 as 3 without evidence of any control mapping; it should score 1 or 2 at most
+### Bad Example 2 — Evidence class mismatch
+```yaml
+criterion_id: STK-01
+score: 4
+evidence_class: observed
+confidence: high
+evidence_anchor: "Section 1"
+excerpt: "Section 1 contains a list of stakeholders."
+rationale: "Stakeholders are identified and the document is clearly structured for them."
+```
+Problems:
+- Score 4 requires stakeholders to be "explicit, complete, and used consistently throughout"
+- "clearly structured for them" is an inference, not an observation — evidence class should be `inferred`
+- The excerpt doesn't quote what stakeholders are listed or how they're used consistently
+- A correct score here is probably 2 or 3 depending on actual content
+### Bad Example 3 — Conflating confidence with score
+```yaml
+criterion_id: ACT-01
+score: 1
+evidence_class: observed
+confidence: low
+rationale: "I'm not confident about this score, so I've scored it lower to be safe."
+```
+Problems:
+- Confidence must not modify the score — this is an explicit rule in EAROS
+- If the evidence supports score 3 but you're uncertain about interpretation, score 3 with `confidence: low`
+- Low confidence tells the human reviewer to apply independent judgment; it doesn't change the number
+---
+## Challenge Pass Methodology
+The challenge pass is Step 6. It exists because agents (and humans) are systematically biased in specific ways:
+**Common over-scoring patterns:**
+- Rewarding presence over quality (something is mentioned → score 2, regardless of depth)
+- Treating fluent prose as evidence of good architecture
+- Accepting assertions without checking for evidence backing
+- Scoring the intent rather than the artifact as written
+**Common under-scoring patterns:**
+- Penalizing technically dense artifacts that are well-reasoned
+- Requiring explicit labels for things that are clearly present
+- Applying a standard that was not in the rubric
+### How to run the challenge pass
+For each of your three highest and three lowest scores:
+1. **Higher challenge:** "What in the artifact would justify scoring this one level higher? Does that evidence exist?"
+   - If yes and you missed it → revise up, flag as `revised: true`
+   - If no → score confirmed
+2. **Lower challenge:** "What would need to be true for this score to be one level lower? Is that condition actually met?"
+   - If yes and you overclaimed → revise down, flag as `revised: true`
+   - If no → score confirmed
+3. **Evidence class check:** "Is my evidence classification honest?"
+   - `observed` should be a direct quote or highly specific reference
+   - If you're paraphrasing liberally → downgrade to `inferred`
+   - If you're applying outside knowledge → classify as `external`
+Mark any score that changes as `revised: true` in the output. This is not a sign of weakness — it's the protocol working correctly.
+---
+## Cross-Reference Validation
+Cross-reference validation (Step 4) finds inconsistencies that reduce the reliability of the artifact and directly affect `CON-01` (internal consistency criterion in the core rubric).
+### What to check
+**Component naming:** Does the context diagram call it "Payment Service" while the sequence diagram calls it "PaymentProcessor" and the deployment view shows "payment-svc"? These may or may not be the same thing — the artifact should make this explicit.
+**Interface consistency:** Does the API spec show a `POST /payments/initiate` endpoint, but the sequence diagram shows `createPayment()` with a different parameter shape? One of them is wrong or they're different things.
+**Scope boundary drift:** Does the context diagram show the mobile app as out of scope, but the data flow section describes data flowing through the mobile app? The scope boundary shifted between sections without notice.
+**Narrative-diagram agreement:** Does the text say "all traffic passes through the API gateway" but the deployment diagram shows a direct database connection from a service?
+### How to score inconsistencies
+Each inconsistency found is evidence for `CON-01`. Use the `decision_tree` for CON-01:
+```
+IF direct contradictions between sections THEN score 0.
+IF same entity has different names or interfaces in multiple views THEN score 1.
+IF minor naming inconsistencies only THEN score 2.
+IF terminology and interfaces consistent across main views THEN score 3.
+IF glossary present AND entities reconciled AND cross-references verified THEN score 4.
+```
+Note each specific inconsistency as a separate evidence item in the CON-01 result.

package/assets/init/.agents/skills/earos-calibrate/SKILL.md ADDED Viewed

@@ -0,0 +1,153 @@
+---
+name: earos-calibrate
+description: "Run EAROS calibration exercises to validate rubric reliability before production use. Use this skill whenever someone wants to calibrate a rubric, validate inter-rater reliability, compare scores against gold-standard artifacts, measure scoring consistency, or says \"calibrate this rubric\", \"run calibration\", \"check if the rubric is reliable\", \"compare my scores to the gold set\", \"test this profile against examples\", \"is this rubric ready for production\", \"what is our kappa\", \"measure agreement between reviewers\", \"validate a new profile\", or \"how well does the rubric score consistently\". Calibration is required before any new profile can move from draft to candidate status."
+---
+# EAROS Calibrate Skill
+You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.
+**Why calibration matters:** A rubric that produces inconsistent scores is not a quality gate — it is noise. Without calibration, two reviewers applying the same rubric will score the same artifact differently, governance decisions will be arbitrary, and the framework loses credibility. Calibration makes the rubric trustworthy by measuring and improving its reproducibility.
+**Target reliability metrics:**
+- Binary agreement (exact match): > 95%
+- Ordinal Cohen's κ: > 0.70 for well-defined criteria; > 0.50 for subjective criteria
+- Spearman ρ (overall score correlation across artifacts): > 0.80
+**Critical:** Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.
+---
+## Step 0 — Load Calibration Inputs
+Read these files:
+1. `core/core-meta-rubric.yaml`
+2. The profile or overlay being calibrated (ask if not specified; scan `profiles/` and `overlays/`)
+3. `calibration/gold-set/` — scan for existing reference artifacts and their benchmark scores
+4. `calibration/results/` — scan for prior calibration runs (to understand trends)
+Ask the user:
+- Which rubric/profile is being calibrated? (if not specified)
+- Are there artifacts to calibrate against, or should I use the gold-set?
+- Solo calibration (agent vs. gold-set) or multi-evaluator reconciliation?
+---
+## Step 1 — Artifact Inventory
+List available calibration artifacts. For each:
+- Artifact ID, title, type
+- Expected quality category: strong (≥3.2), adequate (2.4–3.19), weak (<2.4), borderline
+- Known benchmark scores (if prior calibration exists)
+If no gold-set artifacts exist, stop and tell the user:
+> "Calibration requires at least 3 artifacts: 1 strong (should score ≥3.2), 1 weak (should score <2.4), and 1 ambiguous (borderline case). The spread across quality levels is important — calibration against only strong artifacts doesn't test whether the rubric correctly identifies weaknesses. Please provide these artifacts or their paths."
+---
+## Step 2 — Independent Scoring
+For each calibration artifact, run a full EAROS assessment using the `earos-assess` skill protocol:
+- Follow the full 8-step DAG
+- Score every criterion independently **before** looking at gold-set benchmark scores
+- Record evidence anchors, evidence classes, confidence, and rationale for every score
+**This step cannot be skipped or abbreviated.** Independent scoring is the entire point of calibration. If you score after seeing the benchmark, you measure nothing.
+> **For the full assessment protocol**, see `.claude/skills/earos-assess/SKILL.md`.
+---
+## Step 3 — Score Comparison
+After completing independent scoring for all artifacts, compare against the gold-set:
+```yaml
+artifact_id: [ID]
+criterion_id: [ID]
+gold_score: [benchmark]
+agent_score: [your score]
+delta: [gold - agent]  # positive = agent under-scored; negative = agent over-scored
+delta_abs: [abs(delta)]
+agreement: exact | within_1 | disagreement  # disagreement = delta_abs >= 2
+evidence_quality_match: yes | partial | no
+```
+> **Read `references/calibration-protocol.md`** for the full comparison procedure and how to handle cases where you believe the gold-set benchmark may itself be wrong.
+---
+## Step 4 — Agreement Metric Computation
+> **Read `references/agreement-metrics.md`** before this step. It contains the formulas, computation steps, and interpretation guidance.
+Key metrics to compute:
+**Binary agreement:** (exact matches) / (total scored criteria)
+**Per-criterion reliability flag:**
+- `reliable`: max_delta ≤ 1 across all artifacts
+- `moderate`: max_delta = 2 in isolated cases
+- `unreliable`: max_delta ≥ 2 systematically
+**Overall Spearman ρ:** rank correlation of overall scores across artifacts
+**Verdict:** `pass_for_production` / `borderline` / `not_ready`
+---
+## Step 5 — Root Cause Analysis
+For each `disagreement` (delta ≥ 2) or `unreliable` criterion, investigate:
+1. **Ambiguous level descriptor?** — Does `scoring_guide` clearly distinguish adjacent levels?
+2. **Missing decision tree?** — Does the criterion have a `decision_tree`?
+3. **Evidence classification issue?** — Observed vs. inferred disagreement?
+4. **Anti-pattern match?** — Did the artifact exhibit an anti-pattern scored differently by the gold-set?
+5. **Context sensitivity?** — Is this criterion's meaning different for different artifact sub-types?
+For each root cause, recommend a specific rubric improvement (which field to change and how).
+---
+## Step 6 — Calibration Report
+Save the report to `calibration/results/[rubric-id]-calibration-[YYYY-MM-DD].yaml`
+> **Read `references/calibration-protocol.md#report-format`** for the full YAML and markdown report templates.
+---
+## Step 7 — Recalibration Triggers
+After saving results, check whether recalibration is needed sooner than the standard 6-month cycle. Recalibrate when:
+- Profile criteria change materially (version bump)
+- New overlay introduced
+- Agreement drops below targets on any criterion
+- New artifact formats appear (new diagramming tools, document formats)
+- Agent model changes materially
+- Governance expectations change
+Tell the user: "Schedule next calibration check: [6 months from today or at next profile revision]."
+---
+## Non-Negotiable Rules
+1. **Score independently first.** Never look at gold-set before producing your own assessment.
+2. **Don't calibrate to pass.** If you systematically disagree with the gold-set, flag it — the gold-set may need review too.
+3. **Unreliable criteria must be fixed before production.** κ < 0.50 = not usable in governance.
+4. **Calibration is ongoing.** A profile that passes today must be recalibrated after any material change.
+---
+## When to Read Which Reference File
+| When | Read |
+|------|------|
+| Before Step 3 (always) | `references/calibration-protocol.md` |
+| Computing agreement metrics (Step 4) | `references/agreement-metrics.md` |
+| Interpreting κ values | `references/agreement-metrics.md#interpretation` |
+| Writing the calibration report | `references/calibration-protocol.md#report-format` |
+| Investigating disagreements (Step 5) | `references/calibration-protocol.md#root-cause-analysis` |
+| Unsure if gold-set benchmark is correct | `references/calibration-protocol.md#gold-set-disagreement` |

package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md ADDED Viewed

@@ -0,0 +1,188 @@
+# Agreement Metrics — EAROS Calibration
+This file explains the agreement metrics used in EAROS calibration, how to compute them, and how to interpret the results. Read this before Step 4 (agreement metric computation).
+---
+## Why These Metrics?
+EAROS uses three complementary metrics because each answers a different question:
+- **Binary agreement** answers: "How often do we get the exact same score?"
+- **Cohen's κ (ordinal/weighted)** answers: "Is our agreement better than random chance, accounting for the ordinal scale?"
+- **Spearman ρ** answers: "Do we agree on which artifacts are better than which others?"
+A rubric that passes all three is reliable. A rubric that passes only one may have a specific problem (e.g., good rank-ordering but systematic score inflation).
+---
+## Metric 1 — Binary Agreement (Exact Match Rate)
+**What it measures:** Percentage of criteria where agent score = gold-set score exactly.
+**Formula:**
+```
+binary_agreement = (count of exact matches) / (total scored criteria) × 100%
+```
+**Target:** > 95%
+**Computation:**
+1. For each criterion in each artifact, compare `agent_score` to `gold_score`
+2. Count exact matches (delta = 0)
+3. Divide by total scored criteria (exclude N/A criteria from both sets)
+**Example:**
+```
+Criteria scored: 30 (3 artifacts × 10 criteria)
+Exact matches: 24
+Binary agreement: 24/30 = 80%  ← BELOW TARGET — investigate
+```
+**Interpretation:**
+- > 95%: Excellent — rubric is highly reproducible
+- 85–95%: Acceptable — minor rubric refinements may help
+- < 85%: Concerning — systematic disagreements; investigate root causes
+---
+## Metric 2 — Ordinal Cohen's κ (Weighted Kappa) {#interpretation}
+**What it measures:** Agreement corrected for chance, using linear weights that give partial credit for near-misses (delta = 1 counts as partial agreement).
+**Why weighted (not simple) kappa:** EAROS uses a 0–4 ordinal scale. Disagreeing by 1 point (scoring 2 vs. 3) is less serious than disagreeing by 2 points (scoring 1 vs. 3). Weighted kappa captures this.
+**Linear weight matrix (5-point scale):**
+```
+          Gold 0  Gold 1  Gold 2  Gold 3  Gold 4
+Agent 0:   1.00    0.75    0.50    0.25    0.00
+Agent 1:   0.75    1.00    0.75    0.50    0.25
+Agent 2:   0.50    0.75    1.00    0.75    0.50
+Agent 3:   0.25    0.50    0.75    1.00    0.75
+Agent 4:   0.00    0.25    0.50    0.75    1.00
+```
+**Computing kappa without a statistics library (simplified):**
+Step 1: Build the confusion matrix (rows = agent scores, columns = gold scores, cells = count).
+Step 2: Compute observed weighted agreement:
+```
+P_obs = sum(weight[i,j] × n[i,j]) / N
+  where n[i,j] = count at row i, col j; N = total observations
+```
+Step 3: Compute expected weighted agreement (if random based on marginals):
+```
+P_exp = sum(weight[i,j] × row_marginal[i] × col_marginal[j]) / N²
+```
+Step 4: κ = (P_obs - P_exp) / (1 - P_exp)
+**Simplified approximation for small samples (3 artifacts):**
+- All deltas ≤ 1: κ likely > 0.70 (reliable)
+- Some deltas = 2: κ likely 0.50–0.70 (moderate)
+- Any delta = 3–4: κ likely < 0.50 (unreliable)
+**Target thresholds:**
+- κ > 0.70: Substantial agreement — criterion reliable for governance use
+- κ 0.50–0.70: Moderate — acceptable for subjective criteria; sharpen level descriptors
+- κ < 0.50: Poor — criterion needs revision before production use
+**EAROS-specific expectations:**
+- Criteria with clear observable conditions (SCP-01, MNT-01): typically κ > 0.75
+- Criteria requiring judgment (RAT-01 trade-off quality): typically κ 0.55–0.70
+- Criteria evaluating depth (CMP-01 compliance treatment): hardest — target κ > 0.50
+---
+## Metric 3 — Spearman Rank Correlation (ρ)
+**What it measures:** Whether agent and gold-set agree on the rank ordering of artifacts by quality.
+**Why it matters:** Even if absolute scores differ, a rubric is useful if it reliably distinguishes strong artifacts from weak ones. Good rank correlation means the rubric correctly identifies which artifacts need more work.
+**Formula:**
+```
+ρ = 1 - (6 × Σd²) / (n × (n² - 1))
+  where d = rank difference for each artifact; n = number of artifacts
+```
+**Example (3 artifacts, perfect ordering):**
+```
+Artifact    Gold_rank  Agent_rank  d   d²
+Strong           1          1      0    0
+Adequate         2          2      0    0
+Weak             3          3      0    0
+ρ = 1 - (6×0)/(3×8) = 1.00
+```
+**Example with rank reversal:**
+```
+Artifact    Gold_rank  Agent_rank  d   d²
+Strong           1          2     -1    1
+Adequate         2          1      1    1
+Weak             3          3      0    0
+ρ = 1 - (6×2)/(3×8) = 1 - 0.5 = 0.50  ← Moderate
+```
+**Target:** ρ > 0.80
+**Interpretation:**
+- ρ > 0.80: Good ordering — rubric reliably distinguishes quality levels
+- ρ 0.60–0.80: Moderate — some rank reversals; investigate the reversed artifacts
+- ρ < 0.60: Concerning — systematic misclassification of quality levels
+---
+## Per-Criterion Reliability Flags
+After computing metrics, flag each criterion:
+```
+criterion_id | mean_delta | max_delta | binary_agreement | reliability_flag
+CRITERION-01 |   0.2      |    1      |      93%         | reliable
+CRITERION-02 |   0.8      |    2      |      67%         | moderate
+CRITERION-03 |   1.5      |    3      |      33%         | unreliable
+```
+**Flag definitions:**
+- `reliable`: max_delta ≤ 1 across all artifacts; binary agreement ≥ 90%
+- `moderate`: max_delta = 2 in some cases; binary agreement 70–90%
+- `unreliable`: max_delta ≥ 2 systematically; binary agreement < 70%
+Criteria flagged `unreliable` must be revised before production use.
+---
+## Overall Calibration Verdict
+| Metric | Result | Target | Status |
+|--------|--------|--------|--------|
+| Binary agreement | X% | > 95% | pass/warn/fail |
+| Criteria with κ > 0.70 | X of N | > 80% | pass/warn/fail |
+| Criteria with κ < 0.50 | N | 0 | pass/warn/fail |
+| Spearman ρ | X.XX | > 0.80 | pass/warn/fail |
+**Verdict rules:**
+- `pass_for_production`: All metrics pass; no unreliable criteria
+- `borderline`: One metric borderline or 1–2 unreliable criteria; review before production
+- `not_ready`: Multiple metrics fail or 3+ unreliable criteria; rubric revision required
+---
+## What to Do with Results
+### Pass
+Profile ready for candidate status. Update `status: draft` → `status: candidate`.
+### Borderline
+Review the specific criteria that missed targets:
+1. Is the `decision_tree` present and branches clear?
+2. Does the `scoring_guide` distinguish between levels 2 and 3 specifically?
+3. Are `examples.good` and `examples.bad` realistic and quotable?
+Update flagged criteria and re-run calibration on those criteria only.
+### Not Ready
+Do not advance to candidate. Revise the rubric per root cause analysis in `references/calibration-protocol.md#root-cause-analysis`, then re-run full calibration.