npm - @harness-engineering/cli - Versions diffs - 1.15.0 → 1.16.0 - Mend

@harness-engineering/cli 1.15.0 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (495) hide show

package/dist/agents/skills/claude-code/harness-autopilot/SKILL.md CHANGED Viewed

@@ -33,6 +33,20 @@ Autopilot orchestrates these persona agents — it never reimplements their logi
 **Plans are gated by concern signals.** When no concern signals fire (low complexity, no planner concerns, task count within threshold), plans are auto-approved with a structured report and execution proceeds immediately. When any signal fires, the plan pauses for human review with the standard yes/revise/skip/stop flow. The `--review-plans` session flag forces all plans to pause regardless of signals.
+## Rigor Levels
+The `rigorLevel` is set during INIT via `--fast` or `--thorough` flags and persists for the entire session. Default is `standard`.
+| State          | `fast`                                                                                 | `standard` (default)                                                    | `thorough`                                                                                                                 |
+| -------------- | -------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
+| PLAN           | Pass `rigorLevel: fast` to planner. Planner skips skeleton pass.                       | Default planner behavior.                                               | Pass `rigorLevel: thorough` to planner. Planner always produces skeleton for approval.                                     |
+| APPROVE_PLAN   | Auto-approve all plans regardless of concern signals. Skip human review.               | Default signal-based approval logic.                                    | Force human review of all plans (equivalent to `--review-plans`).                                                          |
+| EXECUTE        | Skip scratchpad — agents keep research in conversation. Checkpoint commits still fire. | Agents use scratchpad for research >500 words. Checkpoint commits fire. | Verbose scratchpad — agents write all research, reasoning, and intermediate output to scratchpad. Checkpoint commits fire. |
+| VERIFY         | Minimal verification — run `harness validate` only. Skip detailed verification agent.  | Default verification pipeline.                                          | Full verification — run verification agent with expanded checks.                                                           |
+| PHASE_COMPLETE | Scratchpad clear is a no-op (nothing written).                                         | Clear scratchpad for completed phase.                                   | Clear scratchpad for completed phase.                                                                                      |
+When `rigorLevel` is `fast`, the APPROVE_PLAN concern signal evaluation is bypassed entirely — plans always auto-approve. When `rigorLevel` is `thorough`, it implicitly sets `reviewPlans: true` for the APPROVE_PLAN gate.
 ## Process
 ### State Machine
@@ -61,7 +75,7 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    - Create the session directory if it does not exist
 3. **Check for existing state.** Read `{sessionDir}/autopilot-state.json`. If it exists and `currentState` is not `DONE`:
-   - **Schema migration:** If `schemaVersion < 3`, backfill missing fields: set `startingCommit` to the earliest commit in `history` (or current HEAD if no history), set `decisions` to `[]`, set `finalReview` to `{ "status": "pending", "findings": [], "retryCount": 0 }`. If `schemaVersion < 4`, set `reviewPlans` to `false`. Update `schemaVersion` to `4` and save.
+   - **Schema migration:** If `schemaVersion < 3`, backfill missing fields: set `startingCommit` to the earliest commit in `history` (or current HEAD if no history), set `decisions` to `[]`, set `finalReview` to `{ "status": "pending", "findings": [], "retryCount": 0 }`. If `schemaVersion < 4`, set `reviewPlans` to `false`. Update `schemaVersion` to `4` and save. If `schemaVersion < 5`, set `rigorLevel` to `"standard"`. Update `schemaVersion` to `5` and save.
    - Report: "Resuming autopilot from state `{currentState}`, phase {currentPhase}: {phaseName}."
    - Skip steps 4 and 5 (initial state creation and flag parsing) — these only apply to fresh starts.
    - Skip to the recorded `currentState` and continue from there.
@@ -76,11 +90,12 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    - Create `{sessionDir}/autopilot-state.json`:
      ```json
      {
-       "schemaVersion": 4,
+       "schemaVersion": 5,
        "sessionDir": ".harness/sessions/<slug>",
        "specPath": "<path to spec>",
        "startingCommit": "<git rev-parse HEAD output>",
        "reviewPlans": false,
+       "rigorLevel": "standard",
        "currentState": "ASSESS",
        "currentPhase": 0,
        "phases": [
@@ -106,7 +121,12 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
      }
      ```
-5. **Parse session flags.** Check CLI arguments for `--review-plans`. If present, set `state.reviewPlans: true` in the state file. This flag persists for the entire session — resuming a session preserves the setting from when it was started (the flag is only read on fresh start, not on resume).
+5. **Parse session flags.** Check CLI arguments for session-level flags. These persist for the entire session -- resuming a session preserves the settings from when it was started (flags are only read on fresh start, not on resume).
+   - `--review-plans`: Set `state.reviewPlans: true`.
+   - `--fast`: Set `state.rigorLevel: "fast"`. Reduces rigor across all phases: skip skeleton approval, skip scratchpad, minimal verification.
+   - `--thorough`: Set `state.rigorLevel: "thorough"`. Increases rigor across all phases: require skeleton approval, verbose scratchpad, full verification.
+   - If neither `--fast` nor `--thorough` is passed, `rigorLevel` defaults to `"standard"`.
+   - If both `--fast` and `--thorough` are passed, reject with error: "Cannot use --fast and --thorough together. Choose one."
 6. **Load context via gather_context.** Use the `gather_context` MCP tool to load all working context efficiently:
@@ -176,10 +196,23 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
        Session directory: {sessionDir}
        Session slug: {sessionSlug}
        Phase description: {phase description from spec}
+       Rigor level: {rigorLevel}
        On startup, call gather_context({ session: "{sessionSlug}" }) to load
        session-scoped learnings, state, and validation context.
+       ## Scratchpad (if rigorLevel is not "fast")
+       For bulky research output (spec analysis, codebase exploration notes,
+       dependency analysis — anything >500 words), write to scratchpad instead
+       of keeping in conversation:
+         writeScratchpad({ session: "{sessionSlug}", phase: "{phaseName}", projectPath: "{projectPath}" }, "research-{topic}.md", content)
+       Reference the scratchpad file path in your conversation instead of
+       inlining the content. This keeps the planning context focused on
+       decisions and task structure.
        Follow the harness-planning skill process exactly. Write the plan to
        docs/plans/{date}-{phase-name}-plan.md. Write {sessionDir}/handoff.json when done.
    ```
@@ -215,7 +248,12 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    - Effective complexity (original + any override)
    - Concerns array from the planning handoff (`{sessionDir}/handoff.json` field `concerns`, default: `[]` if field is absent)
-2. **Evaluate `shouldPauseForReview`.** Check the following signals in order. If **any** signal is true, pause for human review. If **all** are false, auto-approve.
+2. **Rigor-level override:**
+   - If `rigorLevel` is `"fast"`: Skip the signal evaluation entirely. Auto-approve the plan. Record decision as `"auto_approved_plan_fast"`. Transition directly to EXECUTE.
+   - If `rigorLevel` is `"thorough"`: Force `shouldPauseForReview = true` regardless of other signals (equivalent to `--review-plans`).
+   - If `rigorLevel` is `"standard"`: Proceed with normal signal evaluation below.
+3. **Evaluate `shouldPauseForReview`.** Check the following signals in order. If **any** signal is true, pause for human review. If **all** are false, auto-approve.
    | #   | Signal               | Condition                             | Description                                                                                                                                                                                  |
    | --- | -------------------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -225,7 +263,7 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    | 4   | `plannerConcerns`    | Handoff `concerns` array is non-empty | Planner flagged specific risks or uncertainties                                                                                                                                              |
    | 5   | `taskCount`          | Plan contains > 15 tasks (i.e., 16+)  | Plan is large enough to warrant human review                                                                                                                                                 |
-3. **Build the signal evaluation result** for reporting and recording:
+4. **Build the signal evaluation result** for reporting and recording:
    ```json
    {
@@ -238,7 +276,7 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    }
    ```
-4. **If auto-approving (no signals fired):**
+5. **If auto-approving (no signals fired):**
    a. **Emit structured auto-approve report:**
@@ -270,7 +308,7 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    c. **Transition to EXECUTE** — no human interaction needed.
-5. **If pausing for review (one or more signals fired):**
+6. **If pausing for review (one or more signals fired):**
    a. **Emit structured pause report** showing which signal(s) triggered:
@@ -312,7 +350,7 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    Use the actual decision value: `approved_plan`, `revised_plan`, `skipped_phase`, or `stopped`.
-6. **Update state** with `currentState: "EXECUTE"` (or appropriate state for skip/stop) and save.
+7. **Update state** with `currentState: "EXECUTE"` (or appropriate state for skip/stop) and save.
 ---
@@ -331,16 +369,35 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
        Session directory: {sessionDir}
        Session slug: {sessionSlug}
        State: {sessionDir}/state.json
+       Rigor level: {rigorLevel}
        On startup, call gather_context({ session: "{sessionSlug}" }) to load
        session-scoped learnings, state, and validation context.
+       ## Scratchpad (if rigorLevel is not "fast")
+       For bulky intermediate output (test output analysis, error investigation
+       notes, dependency trees — anything >500 words), write to scratchpad:
+         writeScratchpad({ session: "{sessionSlug}", phase: "{phaseName}", projectPath: "{projectPath}" }, "task-{N}-{topic}.md", content)
+       Reference the scratchpad file path instead of inlining the content.
        Follow the harness-execution skill process exactly.
        Update {sessionDir}/state.json after each task.
        Write {sessionDir}/handoff.json when done or when blocked.
    ```
 2. **When the agent returns, check the outcome:**
+   - **After each checkpoint verification passes**, commit the work:
+     ```
+     commitAtCheckpoint({
+       projectPath: "{projectPath}",
+       session: "{sessionSlug}",
+       checkpointLabel: "Checkpoint {N}: {checkpoint description}"
+     })
+     ```
+     If the commit result shows `committed: false`, no changes existed — continue silently.
    - **All tasks complete:** Transition to VERIFY.
    - **Checkpoint reached:** Surface the checkpoint to the user in the main conversation. Handle the checkpoint type:
      - `[checkpoint:human-verify]` — Show output, ask for confirmation, then resume execution agent.
@@ -357,6 +414,16 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
      - **Attempt 2:** Expand context — read related files, check `learnings.md` for similar failures, re-dispatch with additional context.
      - **Attempt 3:** Full context gather — read test output, imports, plan instructions for ambiguity. Re-dispatch with maximum context.
    - If budget exhausted:
+     - **Recovery commit:** Before stopping, commit any passing work:
+       ```
+       commitAtCheckpoint({
+         projectPath: "{projectPath}",
+         session: "{sessionSlug}",
+         checkpointLabel: "Phase {N}: {name} — recovery at task {taskNumber}",
+         isRecovery: true
+       })
+       ```
+       This preserves all work completed before the failure. The `[autopilot][recovery]` prefix in the commit message distinguishes recovery commits from normal checkpoint commits.
      - **Stop.** Present all 3 attempts with full context to the user.
      - Record failure in `.harness/failures.md`.
      - Ask: "How should we proceed? (fix manually and continue / revise plan / stop)"
@@ -368,7 +435,11 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
 ### VERIFY — Post-Execution Validation
-1. **Dispatch verification agent using the Agent tool:**
+1. **Rigor-level branching:**
+   - If `rigorLevel` is `"fast"`: Skip the verification agent entirely. Run only `harness validate`. If it passes, transition to REVIEW. If it fails, surface to user.
+   - If `rigorLevel` is `"thorough"` or `"standard"`: Dispatch the verification agent as below.
+2. **Dispatch verification agent using the Agent tool:**
    ```
    Agent tool parameters:
@@ -387,14 +458,14 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
        Report pass/fail with findings.
    ```
-2. **When the agent returns:**
+3. **When the agent returns:**
    - **All checks pass:** Transition to REVIEW.
    - **Failures found:** Surface findings to the user. Ask: "Fix these issues before review? (fix / skip verification / stop)"
      - **fix** — Re-enter EXECUTE with targeted fixes (retry budget resets for verification fixes).
      - **skip** — Record skip decision in `decisions` array. Proceed to REVIEW with verification warnings noted.
      - **stop** — Save state and exit.
-3. **Update state** with `currentState: "REVIEW"` and save.
+4. **Update state** with `currentState: "REVIEW"` and save.
 ---
@@ -458,9 +529,11 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
 3. **Mark phase as `complete`** in state.
-4. **Sync roadmap.** If `docs/roadmap.md` exists, call `manage_roadmap` with action `sync` and `apply: true`. This reflects the just-completed phase in the roadmap (e.g., updating the feature from `planned` to `in-progress`). If `manage_roadmap` is unavailable, fall back to direct file manipulation using `syncRoadmap()` from core. Skip silently if no roadmap exists. Do not use `force_sync: true` — the human-always-wins rule applies.
+4. **Clear scratchpad for this phase.** Call `clearScratchpad({ session: sessionSlug, phase: phaseName, projectPath: projectPath })` to delete ephemeral research files for the completed phase. This frees disk space and prevents stale scratchpad data from leaking into future phases.
-5. **Write session summary.** Update the session summary to reflect the completed phase:
+5. **Sync roadmap.** If `docs/roadmap.md` exists, call `manage_roadmap` with action `sync` and `apply: true`. This reflects the just-completed phase in the roadmap (e.g., updating the feature from `planned` to `in-progress`). If `manage_roadmap` is unavailable, fall back to direct file manipulation using `syncRoadmap()` from core. Skip silently if no roadmap exists. Do not use `force_sync: true` — the human-always-wins rule applies.
+6. **Write session summary.** Update the session summary to reflect the completed phase:
    ```json
    writeSessionSummary(projectPath, sessionSlug, {
@@ -476,7 +549,7 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
    })
    ```
-6. **Check for next phase:**
+7. **Check for next phase:**
    - If more phases remain: "Phase {N} complete. Next: Phase {N+1}: {name} (complexity: {level}). Continue? (yes / stop)"
      - **yes** — Increment `currentPhase`, reset `retryBudget`, transition to ASSESS.
      - **stop** — Save state and exit.
@@ -626,6 +699,9 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
 - **Learnings** — `.harness/learnings.md` (global) is appended by both delegated skills and autopilot itself. On DONE, session learnings with generalizable outcomes are promoted to global via `promoteSessionLearnings`. If global count exceeds 30, autopilot suggests running `harness learnings prune`.
 - **Roadmap context** — During INIT, reads `docs/roadmap.md` (if present) for project-level priorities, blockers, and milestone status. Provides broader context for phase execution decisions.
 - **Roadmap sync** — During PHASE_COMPLETE, calls `manage_roadmap` with `sync` and `apply: true` to reflect phase progress. During DONE, calls `manage_roadmap` with `update` to set feature status to `done`. Both skip silently when no roadmap exists. Neither uses `force_sync: true`.
+- **Scratchpad** — Agents write bulky research output (>500 words) to `.harness/sessions/<slug>/scratchpad/<phase>/` via `writeScratchpad()` instead of keeping it in conversation context. Cleared automatically at phase transitions via `clearScratchpad()` in PHASE_COMPLETE. Skipped entirely when `rigorLevel` is `"fast"`.
+- **Checkpoint commits** — After each checkpoint verification passes in EXECUTE, `commitAtCheckpoint()` creates a commit with message `[autopilot] <label>`. On failure with retry budget exhausted, a recovery commit is created with `[autopilot][recovery] <label>`. Skipped silently when no changes exist.
+- **Rigor levels** — `--fast` / `--thorough` flags set `rigorLevel` in state during INIT. Persists for the entire session. Affects PLAN (skeleton skip/require), APPROVE_PLAN (auto-approve/force-review), EXECUTE (scratchpad usage), and VERIFY (minimal/full). See the Rigor Behavior Table for details.
 ## Success Criteria
@@ -638,6 +714,11 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
 - Plans auto-approve when no concern signals fire; plans pause for human review when any signal fires
 - `--review-plans` flag forces human review for all plans in a session
 - Phase completion summary shown between every phase
+- `--fast` skips skeleton approval, skips scratchpad, auto-approves plans, and runs minimal verification
+- `--thorough` requires skeleton approval, uses verbose scratchpad, forces plan review, and runs full verification
+- Scratchpad is cleared automatically at every phase transition (PHASE_COMPLETE)
+- Checkpoint commits fire after every passing checkpoint; recovery commits fire on retry budget exhaustion
+- Rigor level persists across session resume — set once during INIT, never changed mid-session
 ## Examples
@@ -645,6 +726,34 @@ INIT → ASSESS → PLAN → APPROVE_PLAN → EXECUTE → VERIFY → REVIEW →
 **User invokes:** `/harness:autopilot docs/changes/security-scanner/proposal.md`
+**Or with rigor flag:** `/harness:autopilot docs/changes/security-scanner/proposal.md --fast`
+**INIT (with --fast):**
+```
+Read spec — found 3 phases:
+  Phase 1: Core Scanner (complexity: low)
+  Phase 2: Rule Engine (complexity: high)
+  Phase 3: CLI Integration (complexity: low)
+Rigor level: fast
+Created .harness/sessions/changes--security-scanner--proposal/autopilot-state.json. Starting Phase 1.
+```
+**Phase 1 — APPROVE_PLAN (fast mode):**
+```
+Auto-approved Phase 1: Core Scanner (fast mode — signal evaluation skipped)
+```
+**Phase 1 — EXECUTE (checkpoint commit):**
+```
+[harness-task-executor executes 8 tasks]
+Checkpoint 1: types and interfaces — committed (abc1234)
+Checkpoint 2: core implementation — committed (def5678)
+Checkpoint 3: tests and validation — nothing to commit (skipped)
+```
 **INIT:**
 ```

package/dist/agents/skills/claude-code/harness-autopilot/skill.yaml CHANGED Viewed

@@ -26,6 +26,12 @@ cli:
     - name: review-plans
       description: Force human review of all plans (overrides auto-approve)
       required: false
+    - name: fast
+      description: Run with reduced rigor — skip skeleton approval, skip scratchpad, minimal verification
+      required: false
+    - name: thorough
+      description: Run with maximum rigor — require skeleton approval, verbose scratchpad, full verification
+      required: false
 mcp:
   tool: run_skill
   input:

package/dist/agents/skills/claude-code/harness-code-review/SKILL.md CHANGED Viewed

@@ -58,6 +58,20 @@ interface ReviewFinding {
 | `--deep`          | Pass `--deep` to `harness-security-review` for threat modeling in the security fan-out slot |
 | `--no-mechanical` | Skip mechanical checks (useful if already run in CI)                                        |
 | `--ci`            | Enable eligibility gate, non-interactive output                                             |
+| `--fast`          | Reduced rigor: skip learnings integration, fast-tier agents for all fan-out slots           |
+| `--thorough`      | Maximum rigor: always load learnings, full agent roster + meta-judge, learnings in output   |
+### Rigor Levels
+The `rigorLevel` is set via `--fast` or `--thorough` flags (or passed by autopilot). Default is `standard`. Rigor controls learnings integration, agent tier selection, and output verbosity.
+| Phase      | `fast`                                                            | `standard` (default)                                                                               | `thorough`                                                                                                      |
+| ---------- | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
+| 3. CONTEXT | Skip learnings integration entirely. No `filterByRelevance` call. | Load learnings if `.harness/review-learnings.md` exists. Score and filter via `filterByRelevance`. | Always load learnings. Score and filter via `filterByRelevance`. Fail loudly if learnings file is missing.      |
+| 4. FAN-OUT | All agents run at fast tier. Reduced focus areas.                 | Default tier assignments (see Model Tiers table).                                                  | Full agent roster at default tiers + meta-judge pass that cross-validates findings across domains.              |
+| 7. OUTPUT  | Standard output format.                                           | Standard output format.                                                                            | Include a "Learnings Applied" section listing which learnings influenced the review and their relevance scores. |
+When `rigorLevel` is `fast`, the pipeline optimizes for speed: learnings are skipped entirely and all fan-out agents run at fast tier. When `rigorLevel` is `thorough`, the pipeline optimizes for depth: learnings are always scored and included, the full agent roster runs, a meta-judge validates cross-domain findings, and the output includes which learnings were applied.
 ### Model Tiers
@@ -71,19 +85,39 @@ Tiers are abstract labels resolved at runtime from project config. If no config
 ### Review Learnings Calibration
-Before starting the pipeline, check for a project-specific calibration file:
+Before starting the pipeline, check for a project-specific calibration file. Learnings integration is gated by rigor level:
+- **`fast`:** Skip this section entirely. Do not read or score learnings.
+- **`standard`:** Read learnings if the file exists. Score and filter. If the file does not exist, proceed with default focus areas.
+- **`thorough`:** Always read learnings. If `.harness/review-learnings.md` does not exist, log a warning: "No review-learnings.md found -- thorough mode expects calibration data."
 ```bash
 cat .harness/review-learnings.md 2>/dev/null
 ```
-If `.harness/review-learnings.md` exists:
+If `.harness/review-learnings.md` exists (and rigor is not `fast`):
 1. **Read the Useful Findings section.** Prioritize these categories during review — they have historically caught real issues in this project.
 2. **Read the Noise / False Positives section.** De-prioritize or skip these categories — flagging them wastes the author's time and erodes trust in the review process.
 3. **Read the Calibration Notes section.** Apply these project-specific overrides to your review judgment. These represent deliberate team decisions, not oversights.
-If the file does not exist, proceed with default review focus areas. After completing the review, consider suggesting that the team create `.harness/review-learnings.md` if you notice patterns that would benefit from calibration.
+#### Learnings Relevance Scoring
+When learnings are loaded (standard or thorough mode), score them against the diff context before applying:
+1. **Build the diff context string.** Concatenate: changed file paths (one per line) + diff summary (commit message or PR description).
+2. **Score each learning** using `filterByRelevance(learnings, diffContext, 0.7, 1000)` from `packages/core/src/state/learnings-relevance.ts`.
+   - Each learning is scored against the diff context via Jaccard similarity.
+   - Only learnings scoring >= 0.7 are retained.
+   - Results are sorted by score descending.
+   - Results are truncated to fit within the 1000-token budget.
+3. **Apply filtered learnings** to the review focus areas:
+   - Useful Findings entries that pass the filter: boost priority for those categories.
+   - Noise/False Positive entries that pass the filter: actively suppress those patterns.
+   - Calibration Notes entries that pass the filter: apply as overrides.
+4. **If no learnings pass the 0.7 threshold,** proceed with default focus areas. Do not fall back to unscored inclusion.
+If the file does not exist and rigor is `standard`, proceed with default review focus areas. After completing the review, consider suggesting that the team create `.harness/review-learnings.md` if you notice patterns that would benefit from calibration.
 ## Pipeline Phases
@@ -285,6 +319,12 @@ Use commit history to answer:
 **Tier:** mixed (see per-agent tiers below)
 **Purpose:** Run four parallel review subagents, each with domain-scoped context from Phase 3. Each agent produces findings in the `ReviewFinding` schema.
+**Rigor overrides:**
+- **`fast`:** All four agents run at **fast tier** (haiku-class). Focus areas are unchanged but agents operate with reduced reasoning depth.
+- **`standard`:** Default tier assignments as listed per agent below.
+- **`thorough`:** Default tier assignments + a **meta-judge pass** after all agents return. The meta-judge (strong tier) cross-validates findings across domains: confirms findings cited by multiple agents, flags contradictions, and surfaces cross-cutting concerns that individual agents missed.
 #### Compliance Agent (standard tier)
 Reviews adherence to project conventions, standards, and documentation requirements.
@@ -493,6 +533,16 @@ For each issue, provide:
 - **Request Changes** — Critical or important issues must be addressed.
 - **Comment** — Observations only, no blocking issues.
+**Learnings Applied (thorough mode only):** When `rigorLevel` is `thorough`, append a "Learnings Applied" section after the Assessment:
+```
+**Learnings Applied:**
+- [0.85] "Useful Finding: Missing error handling in service layer" — boosted priority for error handling checks
+- [0.72] "Noise: Style-only import ordering" — suppressed import order findings
+```
+Each entry shows the Jaccard relevance score and how the learning influenced the review. This section is omitted in `fast` and `standard` modes.
 **Exit code:** 0 for Approve/Comment, 1 for Request Changes.
 #### Inline GitHub Comments (`--comment` flag)
@@ -679,6 +729,8 @@ Every review finding MUST cite evidence using one of:
 - **`harness cleanup`** — Optional check during Phase 2 for entropy accumulation in changed files.
 - **Graph queries** — Used in Phase 3 (CONTEXT) for dependency-scoped context and in Phase 5 (VALIDATE) for reachability verification. Graceful fallback when no graph exists.
 - **`emit_interaction`** -- Call after review approval to suggest transitioning to merge/PR creation. Only emitted on APPROVE assessment. Uses confirmed transition (waits for user approval).
+- **Rigor levels** — `--fast` / `--thorough` flags control learnings integration and agent tiers. Fast skips learnings and runs all agents at fast tier. Standard includes learnings if available. Thorough always loads learnings, runs a meta-judge pass, and includes a "Learnings Applied" section in output. See the Rigor Levels table for details.
+- **`filterByRelevance`** — Used in the Review Learnings Calibration section (Phase 3) to score learnings against diff context. Threshold 0.7, token budget 1000. From `packages/core/src/state/learnings-relevance.ts`.
 ## Success Criteria
@@ -696,6 +748,10 @@ Every review finding MUST cite evidence using one of:
 - No code merges with failing harness checks
 - Response to feedback (Role C) is verified before implementation
 - Pushback on incorrect feedback is evidence-based
+- When `rigorLevel` is `fast`, learnings integration is skipped and all fan-out agents run at fast tier
+- When `rigorLevel` is `thorough`, learnings are always loaded and scored, a meta-judge validates cross-domain findings, and a "Learnings Applied" section appears in the output
+- When `rigorLevel` is `standard`, learnings are included if `.harness/review-learnings.md` exists, scored via `filterByRelevance` at 0.7 threshold
+- When all learnings score below 0.7 threshold, zero learnings are included (no fallback to unscored inclusion)
 ## Examples
@@ -753,6 +809,44 @@ Every review finding MUST cite evidence using one of:
 - **Never agree performatively.** "Sure, I'll change that" without understanding why is forbidden. Every change must be understood.
 - **Never skip the YAGNI check.** Every suggestion must answer: "Does this serve a current, concrete need?" Speculative improvements are rejected.
+## Red Flags
+### Universal
+These apply to ALL skills. If you catch yourself doing any of these, STOP.
+- **"I believe the codebase does X"** — Stop. Read the code and cite a file:line
+  reference. Belief is not evidence.
+- **"Let me recommend [pattern] for this"** without checking existing patterns — Stop.
+  Search the codebase first. The project may already have a convention.
+- **"While we're here, we should also [unrelated improvement]"** — Stop. Flag the idea
+  but do not expand scope beyond the stated task.
+### Domain-Specific
+- **"The change looks reasonable, approving"** — Stop. Have you read every changed file? Approval without full review is rubber-stamping.
+- **"Let me fix this issue I found"** — Stop. Review identifies issues; it does not fix them. Suggest the fix, do not apply it.
+- **"This is a minor style issue"** — Stop. Is it a style issue or a readability/maintainability concern? Classify accurately before dismissing.
+- **"The author probably meant to..."** — Stop. Do not infer intent. If the code is ambiguous, flag it as a question for the author.
+## Rationalizations to Reject
+### Universal
+These reasoning patterns sound plausible but lead to bad outcomes. Reject them.
+- **"It's probably fine"** — "Probably" is not evidence. Verify before asserting.
+- **"This is best practice"** — Best practice in what context? Cite the source and
+  confirm it applies to this codebase.
+- **"We can fix it later"** — If it is worth flagging, it is worth documenting now
+  with a concrete follow-up plan.
+### Domain-Specific
+- **"The tests pass, so the logic must be correct"** — Tests can be incomplete. Review the logic independently of test results.
+- **"This is how it was done elsewhere in the codebase"** — Existing patterns can be wrong. Evaluate the pattern on its merits, not just its precedent.
+- **"It's just a refactor, low risk"** — Refactors change behavior surfaces. Review them with the same rigor as feature changes.
 ## Escalation
 - **When reviewers disagree:** If two reviewers give contradictory feedback, escalate to the human or tech lead.

package/dist/agents/skills/claude-code/harness-code-review/skill.yaml CHANGED Viewed

@@ -33,6 +33,12 @@ cli:
     - name: --ci
       description: Enable eligibility gate, non-interactive output
       required: false
+    - name: --fast
+      description: Reduced rigor — skip learnings integration, fast-tier agents only
+      required: false
+    - name: --thorough
+      description: Maximum rigor — always load learnings, full agent roster + meta-judge
+      required: false
 mcp:
   tool: run_skill
   input:

package/dist/agents/skills/claude-code/harness-codebase-cleanup/SKILL.md CHANGED Viewed

@@ -26,7 +26,7 @@
 ### Phase 1: CONTEXT -- Build Hotspot Map
-1. **Run hotspot detection** via `harness skill run harness-hotspot-detector` or equivalent git log analysis:
+1. **Run hotspot detection** via git log analysis:
    ```bash
    git log --format=format: --name-only --since="6 months ago" | sort | uniq -c | sort -rn | head -50
    ```
@@ -38,7 +38,6 @@
 1. **Dead code detection** (skip if `--architecture-only`):
    - Run `harness cleanup --type dead-code --json`
-   - Or use the `detect_entropy` MCP tool with `type: 'dead-code'`
    - Captures: dead files, dead exports, unused imports, dead internals, commented-out code blocks, orphaned dependencies
 2. **Architecture detection** (skip if `--dead-code-only`):
@@ -204,8 +203,7 @@ After removing the `legacy-auth` module:
 - **`harness cleanup --type dead-code --json`** -- Dead code detection input
 - **`harness check-deps --json`** -- Architecture violation detection input
-- **`harness skill run harness-hotspot-detector`** -- Hotspot context for safety classification
-- **`detect_entropy` MCP tool with `autoFix: true`** -- Detects entropy and applies safe fixes via the MCP server
+- **`git log` analysis** -- Hotspot context for safety classification (inline command, no skill invocation needed)
 - **`harness validate`** -- Final validation after all fixes
 - **`harness check-deps`** -- Final architecture check after all fixes

package/dist/agents/skills/claude-code/harness-database/SKILL.md CHANGED Viewed

@@ -250,6 +250,58 @@ CREATE POLICY tenant_isolation ON users
 - **Migration files must include rollback logic.** Every `up` function must have a corresponding `down` function. WHERE a migration is irreversible (data loss on rollback), THEN it must be explicitly marked as such with a comment explaining why.
 - **No migrations that lock large tables without warning.** WHERE a migration performs an ALTER TABLE that acquires an ACCESS EXCLUSIVE lock on a table estimated to have more than 10,000 rows, THEN the skill must flag the lock risk and suggest a non-locking alternative.
+## Evidence Requirements
+When this skill makes claims about existing code, architecture, or behavior,
+it MUST cite evidence using one of:
+1. **File reference:** `file:line` format (e.g., `src/auth.ts:42`)
+2. **Code pattern reference:** `file` with description (e.g., `src/utils/hash.ts` —
+   "existing bcrypt wrapper")
+3. **Test/command output:** Inline or referenced output from a test run or CLI command
+4. **Session evidence:** Write to the `evidence` session section via `manage_state`
+**Uncited claims:** Technical assertions without citations MUST be prefixed with
+`[UNVERIFIED]`. Example: `[UNVERIFIED] The auth middleware supports refresh tokens`.
+## Red Flags
+### Universal
+These apply to ALL skills. If you catch yourself doing any of these, STOP.
+- **"I believe the codebase does X"** — Stop. Read the code and cite a file:line
+  reference. Belief is not evidence.
+- **"Let me recommend [pattern] for this"** without checking existing patterns — Stop.
+  Search the codebase first. The project may already have a convention.
+- **"While we're here, we should also [unrelated improvement]"** — Stop. Flag the idea
+  but do not expand scope beyond the stated task.
+### Domain-Specific
+- **"Running this migration in production"** without a rollback plan — Stop. Every migration must have a tested reverse migration before it touches production data.
+- **"Adding an index to speed up this query"** without checking write patterns — Stop. Indexes speed reads but slow writes. Check both access patterns before recommending.
+- **"Dropping this column, it's unused"** — Stop. Verify no application code references it — including ORMs, background jobs, analytics queries, and reporting systems.
+- **"Let's denormalize this for performance"** — Stop. Denormalization decisions are hard to reverse. Cite the specific query performance problem with evidence before recommending.
+## Rationalizations to Reject
+### Universal
+These reasoning patterns sound plausible but lead to bad outcomes. Reject them.
+- **"It's probably fine"** — "Probably" is not evidence. Verify before asserting.
+- **"This is best practice"** — Best practice in what context? Cite the source and
+  confirm it applies to this codebase.
+- **"We can fix it later"** — If it is worth flagging, it is worth documenting now
+  with a concrete follow-up plan.
+### Domain-Specific
+- **"The table is small, we don't need an index"** — Tables grow. Plan for the steady state, not the current row count.
+- **"The ORM handles this for us"** — ORMs generate SQL that may not match your performance expectations. Review the generated queries for correctness and efficiency.
+- **"We can always add a migration later"** — Schema changes in production have operational cost. Design the schema thoughtfully now rather than migrating repeatedly.
 ## Escalation
 - **Production data at risk:** When a migration would delete or overwrite existing data (DROP COLUMN, column type change that truncates), report: "This migration will permanently delete data in column `X`. Provide a data backup confirmation or approve a non-destructive alternative (add new column, backfill, drop old) before proceeding."

package/dist/agents/skills/claude-code/harness-deployment/SKILL.md CHANGED Viewed

@@ -247,6 +247,58 @@ Phase 4: VALIDATE
 - **No deploy without rollback.** Every deployment target must have a documented or automated rollback mechanism. Missing rollback is a blocking warning.
 - **No skipping pipeline lint.** Pipeline configuration must pass syntax validation before recommendations are made.
+## Evidence Requirements
+When this skill makes claims about existing code, architecture, or behavior,
+it MUST cite evidence using one of:
+1. **File reference:** `file:line` format (e.g., `src/auth.ts:42`)
+2. **Code pattern reference:** `file` with description (e.g., `src/utils/hash.ts` —
+   "existing bcrypt wrapper")
+3. **Test/command output:** Inline or referenced output from a test run or CLI command
+4. **Session evidence:** Write to the `evidence` session section via `manage_state`
+**Uncited claims:** Technical assertions without citations MUST be prefixed with
+`[UNVERIFIED]`. Example: `[UNVERIFIED] The auth middleware supports refresh tokens`.
+## Red Flags
+### Universal
+These apply to ALL skills. If you catch yourself doing any of these, STOP.
+- **"I believe the codebase does X"** — Stop. Read the code and cite a file:line
+  reference. Belief is not evidence.
+- **"Let me recommend [pattern] for this"** without checking existing patterns — Stop.
+  Search the codebase first. The project may already have a convention.
+- **"While we're here, we should also [unrelated improvement]"** — Stop. Flag the idea
+  but do not expand scope beyond the stated task.
+### Domain-Specific
+- **"Deploying without a health check endpoint"** — Stop. Without health checks, the orchestrator cannot detect failed deployments. Add health checks before deploying.
+- **"Skipping canary deployment, it's a small change"** — Stop. Small changes cause outages too. Follow the deployment policy regardless of change size.
+- **"Rolling back manually if something goes wrong"** — Stop. Manual rollback under incident pressure fails. Automate rollback before deploying.
+- **"We can update the runbook after the deploy"** — Stop. If the deployment changes operational behavior, update the runbook first. Stale runbooks during incidents cause escalations.
+## Rationalizations to Reject
+### Universal
+These reasoning patterns sound plausible but lead to bad outcomes. Reject them.
+- **"It's probably fine"** — "Probably" is not evidence. Verify before asserting.
+- **"This is best practice"** — Best practice in what context? Cite the source and
+  confirm it applies to this codebase.
+- **"We can fix it later"** — If it is worth flagging, it is worth documenting now
+  with a concrete follow-up plan.
+### Domain-Specific
+- **"It's just a config change, not a code change"** — Config changes cause outages at the same rate as code changes. Deploy them with the same rigor and rollback strategy.
+- **"We tested this in staging"** — Staging is not production. Traffic patterns, data volume, and edge cases differ. Staging success does not guarantee production safety.
+- **"Downtime will be brief"** — Brief is not zero. Quantify the expected impact and communicate it to stakeholders before deploying.
 ## Escalation
 - **When the CI/CD platform is unsupported:** Report which platform was detected and that analysis is limited to general best practices. Recommend the user provide platform-specific documentation for deeper analysis.