npm - lithermes-ai - Versions diffs - 0.5.0 - Mend

lithermes-ai 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (133) hide show

package/assets/lithermes-plugin/skills/lit-plan/SKILL.md ADDED Viewed

@@ -0,0 +1,374 @@
+---
+name: lit-plan
+description: Hermes-native planning consultant for /lit-plan — explore-first grounding, interview, approval gate, then a gap-analysis + plan-review pass.
+---
+# LitHermes Planning Consultant
+> **Hermes-native overrides (authoritative — read first).** The only
+> subagent primitive available is the model-facing `delegate_task(tasks:[{goal,
+> context, toolsets?, role?}])` tool — children run in parallel, parent blocks
+> for all. There is **no** `spawn_agent`, no named-agent registry, no
+> per-child model selection, and no `agents/*.toml`. Hermes has **no**
+> model-facing goal tools natively; LitHermes adds durable tools
+> `goal_set` / `goal_add_criterion` / `goal_evidence` / `goal_criterion_status`
+> / `goal_steer` / `goal_checkpoint` / `goal_complete` and binds the native
+> standing `/goal` via the session goal manager. Plan state lives under
+> `.hermes/lithermes/`. The `/lit-plan` command stamps a plan markdown to
+> `plans/<slug>.md` — this skill injects the consultant discipline that turns
+> that template into a genuinely reasoned artifact.
+This skill governs how Hermes behaves when `/lit-plan` is invoked. The plan is
+the **durable artifact** that the subsequent goal loop executes — treat producing
+it with the same rigour you would bring to execution.
+LitHermes intentionally fuses planning with execution: `/lit-plan` bootstraps
+the goal and hands off to the execution loop. Do NOT impose a hard
+"planner never implements" rule here. The skill's job is to ensure that what
+gets handed off is grounded, complete, and approved.
+---
+## Phase 0 — Classify the Request
+Before doing anything else, classify the incoming brief into one of three tiers.
+This determines how much exploration, interviewing, and review you invest.
+| Tier | Signal | Exploration depth | Interview rounds | Review pass |
+|------|--------|-------------------|------------------|-------------|
+| **Trivial** | Single file, no new API surface, no cross-cutting concern | Read 2-4 files, no fan-out | 0-1 quick clarifications | Gap-analysis only (skip plan-review) |
+| **Standard** | Multi-file change, touches existing API, moderate scope | Parallel fan-out across patterns + test infra | 1-2 rounds | Both gap-analysis and plan-review |
+| **Architecture** | New subsystem, DB schema, public API, third-party integration, migration | Full repo survey + external doc fetch | Up to 3 rounds | Both passes; plan-review in strict mode |
+**Default: Standard.** Escalate to Architecture when any one of these is true:
+the change touches 5+ modules, introduces a new persistence layer, crosses a
+service boundary, or the brief uses words like "migrate", "replace", "redesign",
+or "integration".
+Emit the classification as the first line of your planning turn:
+```
+[CLASSIFY] Tier: Standard — multi-file change across auth and session modules.
+```
+---
+## Phase 1 — Explore-First Grounding
+**Rule: discoverable facts → explore first. Genuine preferences and tradeoffs →
+ask the user.** Never open an interview before you have done the reading.
+### 1a. What to discover
+Fan out read-only `delegate_task` children (one batch call, children run in
+parallel) to gather:
+- **Repository patterns**: entry points, module boundaries, naming conventions,
+  existing abstractions the new work must align with.
+- **Test infrastructure**: test runner, test helpers, fixture patterns, how
+  integration tests hit the real surface (the channel scenarios you will need).
+- **Existing implementations**: any prior attempt at this feature or a closely
+  related one. Naming collisions. Dead code that may interfere.
+- **Dependency landscape**: which packages are already present and at which
+  version, so you don't recommend adding what is already there.
+- **External facts** (Architecture tier only): official docs for any
+  dependency, API, or protocol the plan references — fetch and cite SHA-pinned
+  permalinks; never mutate the worktree from a research child.
+Each read-only child's goal/context should be inlined directly — no named agent
+type, no foreign registry. Example batch:
+```
+delegate_task(tasks: [
+  {
+    goal: "Find the test runner config and locate two representative integration tests that exercise the session layer. Return file paths and a 2-sentence pattern summary.",
+    context: "Workspace: <path>. Read-only; do not write files."
+  },
+  {
+    goal: "Locate all existing middleware registration points. Return file:line refs and the registration pattern (decorator vs. config dict vs. imperative).",
+    context: "Workspace: <path>. Read-only; do not write files."
+  }
+])
+```
+### 1b. While children run
+Use your own direct read-only tools concurrently: skim the repo root, `package.json` / `pyproject.toml` / `go.mod`, `README`, any `ARCHITECTURE` doc, and the most recent 5 git log lines. These are fast and cheap; do them yourself rather than burning a child slot.
+### 1c. Harvest
+Collect child results. Consolidate into a single internal **Grounding Summary**:
+```
+## Grounding Summary (internal, not shown to user yet)
+- Test runner: <x>  integration test pattern: <file:line>
+- Middleware registration: <pattern>  canonical example: <file:line>
+- Prior attempt: <file> — status: <incomplete / removed / active>
+- External API: <name>  docs: <url-or-"not fetched — Trivial tier">
+- Open ambiguities that exploration cannot resolve: <list>
+```
+Exploration stops when you have enough to write a first-draft plan OR after two
+waves yield no new material — whichever comes first.
+---
+## Phase 2 — Interview Only the Genuine Unknowns
+Interview questions are for **genuine preferences, tradeoffs, and
+constraints** that exploration cannot settle. They are not for facts you could
+read from the repo.
+**Format each question with a recommended default:**
+```
+Q1. Should the new endpoint be versioned under /v2/ or extend the existing /v1/ router?
+    Recommended default: extend /v1/ — no breaking change needed based on the grep results.
+Q2. Is eventual consistency acceptable for the cache invalidation step, or must it be synchronous?
+    Recommended default: synchronous — the existing pattern in cache.ts:42 uses sync invalidation.
+```
+For Trivial tier: skip interview entirely unless there is exactly one question
+that blocks the plan. For Standard: 1-3 questions maximum. For Architecture:
+up to 5 questions; never more.
+**Wait for the user's reply before proceeding.** Do not draft a plan in parallel
+with an outstanding question.
+---
+## Phase 3 — Approval Gate (Non-Negotiable)
+Before generating or finalising the plan, present all three of the following and
+**explicitly ask for the user's go-ahead**:
+### A. Grounding facts surfaced
+A bulleted list of the non-obvious things exploration found — file paths,
+patterns, prior implementations, dependency versions. Keep it tight; omit
+obvious things the user already knows.
+### B. Remaining ambiguities with recommended defaults
+Any question that the user did not fully resolve in Phase 2, restated with the
+default you will apply if they just say "yes, proceed". If no ambiguities
+remain, say so explicitly.
+### C. Intended approach
+A plain-English paragraph (3-6 sentences) describing what the plan will prescribe:
+which modules change, which new files are created, what the test strategy is,
+and what the manual-QA channel scenarios will look like. No markdown structure
+yet — this is the pitch, not the plan.
+Then close with a literal gate line:
+```
+Ready to generate the plan. Please confirm (or steer) before I finalise.
+```
+**Narrow exception**: if this planning turn was triggered by a `/lit-plan
+--bootstrap` flag or an equivalent start-work invocation that is meant to get
+execution started immediately, you may proceed to Phase 4 without waiting — but
+only when the brief is unambiguous, Trivial tier, and exploration found no
+conflicts. Log the skip as `[APPROVAL_GATE_SKIPPED: bootstrap flag + Trivial + no conflicts]`.
+---
+## Phase 4 — Generate the Plan
+Use the template that `/lit-plan` stamped. Fill every section with real
+content; never leave placeholder text in the final output. The machine-parseable
+shape of the Success Criteria block is fixed — preserve it exactly:
+```
+- [ ] C001 | channel: tmux | test: <path::test_id> | scenario: <user-visible outcome>
+```
+### TL;DR
+One sentence summary. Bullet deliverables. Effort and Risk ratings with a
+one-line driver for risk.
+### Success Criteria
+Declare at minimum 3 criteria — more is fine. Every criterion must be:
+- **Paired with an automated test** written before the implementation (file
+  path + test id, not a description). This is the floor.
+- **Paired with a manual-QA channel scenario** (the ceiling). Name the channel
+  (tmux / http / browser / computer) and state what will be run and what
+  the expected observable outcome is. "Tests pass" alone is never a criterion.
+- **Concrete enough to fail**: a scenario that cannot be falsified is not a
+  criterion.
+Cover the happy path, at least one edge/boundary case, and at least one
+adjacent-surface regression check that names the specific file and function at
+risk.
+### Scope
+Must-have items are the deliverables. Must-NOT-have items are the guardrails:
+explicit exclusions that prevent scope creep, over-engineering, or accidental
+introduction of unrelated changes.
+### Verification Strategy
+Name the test framework. State TDD or tests-after (default: TDD). State where
+evidence artifacts land (`.hermes/lithermes/runs/<run>/evidence/`).
+### Execution Waves
+Target 5-8 todos per wave. Fewer than 3 per wave means you are under-splitting.
+50+ total todos across all waves is fine. Each todo encompasses **both**
+implementation and its test — never split them into separate todos.
+**Dependency matrix**: every todo that depends on another must name its
+dependency explicitly. Anything without a dependency goes in Wave 1 and runs in
+parallel.
+### Per-Todo Contract
+Every single todo must carry all four of:
+1. **References**: `file:line` — the exact pattern or contract this todo must
+   follow. Not a description; a pointer.
+2. **Acceptance**: a verifiable command or assertion — something you can run to
+   determine pass/fail unambiguously.
+3. **QA scenario**: `tool=<tmux|curl|playwright|...> steps=<...>
+   expected=<binary pass/fail> evidence=<path>`. Every user-facing behavior
+   must name a channel; data-only or CLI-only behaviors may name `cli`.
+4. **Commit**: a Conventional Commit message in the form
+   `<type>(<scope>): <imperative>`.
+### Final Verification Wave
+Always four fixed items, always last, always all four:
+- **F1** — Plan compliance audit: every task and acceptance criterion met.
+- **F2** — Code quality / diagnostics clean, idioms match, no dead code.
+- **F3** — Real manual QA: every criterion's channel scenario run fresh, with
+  evidence captured and a cleanup receipt recorded.
+- **F4** — Scope fidelity: nothing extra, nothing Must-NOT-have introduced.
+### Commit Strategy
+Atomic Conventional Commits. Each commit builds and tests green on its own.
+Final commit footer: `Plan: plans/<slug>.md`.
+---
+## Phase 5 — Pre-Finalize Review
+Before declaring the plan ready, run two read-only passes. Both are
+`delegate_task` children with their mandates inlined.
+### Pass A — Pre-Plan Gap Analysis
+```
+delegate_task(tasks: [{
+  goal: "Pre-plan gap-analysis. Read the plan draft below and return a verdict of CLEAR or GAPS-FOUND. Find: (1) internal contradictions between sections, (2) ambiguous or missing constraints that would block execution, (3) execution risks not acknowledged in the risk rating, (4) topology gaps — todos that cannot start because a dependency is missing or circular. For each gap found, cite the plan section and propose a minimal fix. Do not propose new features. Return verdict on its own line as the last line.",
+  context: "<paste plan markdown here>"
+}])
+```
+**If verdict is GAPS-FOUND**: fold the fixes in silently (do not re-open the
+approval gate for minor fixes) unless a fix changes the intended approach
+substantially — in that case, surface the delta to the user first.
+### Pass B — Plan Review
+```
+delegate_task(tasks: [{
+  goal: "Plan review. Read the plan draft below and return a verdict of OKAY, ITERATE, or REJECT. Check: (1) every referenced file or path exists and contains the claimed content — if you cannot verify without a read tool, say so explicitly rather than assuming; (2) every todo is startable — it has a clear trigger, no hidden pre-conditions, and its acceptance criterion is verifiable by a command; (3) every QA scenario is concrete — it names a real tool, real steps, and a binary expected outcome. Approval-biased: when in doubt, approve. ITERATE allows ≤3 fixable issues and ≤2 auto rounds of revision before escalating. REJECT only when a decision is needed from the user.",
+  context: "<paste plan markdown here>"
+}])
+```
+**Verdict handling**:
+| Verdict | Action |
+|---------|--------|
+| OKAY | Proceed. Surface the final plan path. |
+| ITERATE | Apply the ≤3 fixes inline, re-run Pass B (up to 2 auto rounds). If still ITERATE after round 2, surface the remaining issues to the user. |
+| REJECT | Surface the issues to the user and wait for a decision before re-drafting. |
+Both passes run in the same `delegate_task` call (parallel) when the plan is
+Standard or Architecture tier. For Trivial tier: run Pass A only; skip Pass B.
+---
+## Output — Surfacing the Final Plan
+Once both passes clear, output:
+1. A one-line summary of what exploration found that materially shaped the plan
+   (so the user knows it was grounded, not guessed).
+2. The path to the written plan: `plans/<slug>.md`.
+3. The open success criteria count and the first wave of todos (so the user can
+   spot-check scope at a glance).
+4. The handoff line — this bridges to the execution loop:
+```
+Plan is ready. Use /lit-loop "<brief>" or /start-work <slug> to begin execution.
+The plan path will appear in each execution commit footer: Plan: plans/<slug>.md
+```
+Do **not** reproduce the full plan in the chat output — it is already written to
+disk. If the user wants to read it, they can open `plans/<slug>.md`.
+---
+## Delegation Rules
+- Fan out read-only exploration as a single `delegate_task` batch (one call,
+  multiple children in the `tasks` array). Never batch a write-capable child
+  with a read-only child.
+- Gap-analysis and plan-review are both read-only; they may run in the same
+  batch.
+- Never serialize children that are independent. Never parallelize children that
+  share a write target or consume each other's output.
+- Inline every child's mandate — do not reference an external prompt file or a
+  named role from a registry.
+- After a `delegate_task` returns, re-read its output rather than trusting its
+  self-report; verify file paths cited by children actually exist before
+  including them in the plan.
+---
+## Constraints
+- Do not generate the plan before the approval gate clears (Phase 3).
+- Do not ask interview questions about facts that exploration could have
+  discovered — if you find yourself asking "which test framework do you use?"
+  after Phase 1, you skipped a read.
+- Do not introduce new abstractions, new dependencies, or new architectural
+  patterns unless the brief explicitly requires them.
+- Every QA scenario in the plan must use one of the four Manual-QA channels
+  (tmux / http / browser / computer). `--dry-run`, "should respond", and
+  "looks correct" are not channels.
+- The Success Criteria machine-parseable shape is fixed. Do not alter the
+  `C0NN | channel: | test: | scenario:` format.
+- Plan files live under `plans/<slug>.md`. Do not write to any other path.
+- Exploration children are read-only. They must not write to the worktree, run
+  tests, or mutate any state.
+- Do not add a "planner never executes" hard rule. The plan is the artifact
+  that the execution loop picks up; they are two phases of the same workflow,
+  not separate domains.
+---
+## Stop Rules
+- **Normal stop**: plan is on disk, both review passes are OKAY or CLEAR, plan
+  path has been surfaced to the user.
+- **User steers mid-phase**: update the grounding summary, re-run only the
+  affected phases (e.g., if the user changes the scope during Phase 3,
+  re-generate the relevant sections and re-run Phase 5).
+- **REJECT verdict from plan-review after user input**: re-draft the affected
+  sections and re-run Phase 5 from scratch.
+- **Two exploration waves yield nothing new**: stop exploring and proceed with
+  what you have; surface the gap to the user in Phase 3.
+- **Approval gate blocked for more than one exchange**: surface the exact
+  ambiguity that is blocking progress and propose a default; do not spin.

package/assets/lithermes-plugin/skills/litgoal/.gitkeep ADDED Viewed

File without changes

package/assets/lithermes-plugin/skills/litgoal/SKILL.md ADDED Viewed

@@ -0,0 +1,207 @@
+---
+name: litgoal
+description: Durable repo-native goal runtime with embedded success criteria, evidence ledger, checkpoints, steering, and an evidence-gated completion contract.
+---
+## LitHermes durable runtime (authoritative — overrides any legacy examples below)
+LitHermes ships litgoal as a **real durable runtime**, not prose. State persists under
+`<workspace>/.hermes/lithermes/litgoal/`:
+- `goals.json` — goals, each with `criteria` (id, scenario, qa_channel, test_ref, status,
+  evidence[]), `checkpoints`, `steering`, and `review_blockers`.
+- `ledger.jsonl` — append-only audit trail (`goal_created`, `criterion_added`,
+  `evidence_added`, `criterion_status`, `checkpoint`, `steer`, `review_blocker`,
+  `goal_completed`).
+- `evidence/` — captured artifacts.
+Drive it through the **model-facing goal tools** (preferred) or the CLI. This skill was ported
+from an upstream workflow plugin; any legacy upstream command names or `.<vendor>/litgoal/…`
+state paths that survive in the prose below do NOT exist in Hermes — translate every one to the
+following native surface:
+| Capability | LitHermes tool | LitHermes CLI |
+| --- | --- | --- |
+| create goals from a brief | `goal_set {objective, criteria[]}` | `hermes lithermes goal set --objective "…" --criterion "scenario\|channel\|test_ref"` |
+| add one criterion | `goal_add_criterion {scenario, qa_channel, test_ref}` | `hermes lithermes goal criterion --scenario "…" --qa-channel tmux --test-ref "…"` |
+| record evidence | `goal_evidence {criterion_id, kind, ref, detail}` | `hermes lithermes goal evidence <cid> --kind red\|green\|scenario\|cleanup --ref "<path/test-id>"` |
+| set criterion status | `goal_criterion_status {criterion_id, status}` | `hermes lithermes goal criterion-status <cid> pass` |
+| steer | `goal_steer {directive}` | `hermes lithermes goal steer "<directive>"` |
+| checkpoint | `goal_checkpoint {summary, active_criterion}` | `hermes lithermes goal checkpoint "<summary>"` |
+| record/clear review blockers | (add) `goal_steer` blocker / (clear) resolve-blocker | `hermes lithermes goal blocker "<detail>"` / `resolve-blocker <id>` |
+| status | `goal_status` | `hermes lithermes goal status` |
+| complete | `goal_complete` | `hermes lithermes goal complete` |
+**Evidence-gated completion (the whole point).** `goal_complete` is *refused* until the
+quality gate passes: every criterion must be `pass` AND carry both a `green` (RED→GREEN
+proof) and a `scenario` (manual-QA artifact) evidence entry, AND no `review_blocker` may be
+unresolved. The active goal + the next unmet criteria + the gate status are injected into
+every turn via the plugin's `pre_llm_call` snapshot, so you always see what is still owed.
+Native Hermes goals/subgoals still exist; litgoal **layers** the durable criteria/
+evidence/checkpoint/steering/gate on top — keep one native objective, do not duplicate it.
+The discipline below is correct in spirit. If any legacy upstream command name or foreign
+state path slips through, treat it as a LEGACY illustration only — never run an upstream
+binary and never write under any path other than `.hermes/lithermes/litgoal/`. Translate
+each to the LitHermes goal tool / `hermes lithermes goal` CLI / `.hermes/lithermes/litgoal/`
+surface from the table above.
+## Role
+Expert goal orchestration agent. Plan multi-goal work that survives across turns and sessions.
+Work outcome-first: evidence-bound, atomic decisions, no nested branching prose.
+## Goal
+Deliver every goal in `.hermes/lithermes/litgoal/goals.json` end-to-end.
+Prove EVERY success criterion with captured observable evidence from a real-usage scenario you actually ran (HTTP call / tmux / browser use / computer use — see the Manual-QA channels below).
+TESTS ALONE NEVER PROVE DONE. A green test suite is supporting evidence, not completion proof.
+Audit each pass, fail, block, steering change, and checkpoint in `.hermes/lithermes/litgoal/ledger.jsonl`.
+## Manual-QA channels (PICK ONE PER CRITERION — ACTUALLY RUN IT)
+For every criterion, build a real-usage scenario through ONE of these four channels and run it yourself before recording PASS. The full test suite being green is NEVER verification on its own.
+1. **HTTP call** — hit the live endpoint with `curl -i` (or a Playwright APIRequestContext); capture status line + headers + body.
+2. **tmux** — `tmux new-session -d -s lit-qa-<criterion>`, drive with `send-keys`, dump via `tmux capture-pane -pS -E -`; transcript is the artifact.
+3. **Browser use** — drive the real page via Playwright / puppeteer / Chromium; capture action log + screenshot path.
+4. **Computer use** — OS-level GUI automation (computer-use agent, AppleScript, xdotool, etc.) against the running app; capture action log + screenshot.
+Auxiliary surfaces (pure CLI stdout / DB state diff / parsed config dump) satisfy CLI- or data-shaped criteria but NEVER replace a channel scenario for user-facing behavior. `--dry-run`, printing the command, "should respond", and "looks correct" never count.
+## Artifacts
+- `.hermes/lithermes/litgoal/brief.md`: original brief and durable constraints.
+- `.hermes/lithermes/litgoal/goals.json`: goals with embedded `successCriteria` per goal.
+- `.hermes/lithermes/litgoal/ledger.jsonl`: append-only audit trail.
+- Read artifacts before resuming, steering, or checkpointing.
+- Never invent state outside `.hermes/lithermes/litgoal/` artifacts or `goal_status` / `hermes lithermes goal status`.
+## Bootstrap
+Do all three steps before execution. No edits, goal tools, or checkpointing before bootstrap completes.
+### 1. Create goals from the brief
+Call the durable goal tool (preferred), or the CLI:
+```sh
+# model-facing tool
+goal_set { "objective": "<brief>", "criteria": [ … ] }
+# or CLI equivalents
+hermes lithermes goal set --objective "<brief>"
+hermes lithermes goal set --objective-file <path>
+cat <brief> | hermes lithermes goal set --objective-stdin
+```
+Write state through the goal tool or CLI path. Do not hand-edit state files.
+### 2. Refine success criteria per goal
+Define pass/fail acceptance criteria before launching execution lanes. Include the command, artifact, or manual check that will prove success.
+Each goal MUST carry 3+ `successCriteria` covering happy path, edge, regression, and adversarial risk.
+For each criterion set: `id`, `scenario`, `expectedEvidence`, adversarial classes, stop condition, and the Manual-QA channel (HTTP call / tmux / browser use / computer use) that will exercise it.
+Apply ultraqa classes where relevant: malformed input, repeated interruptions, prompt injection, cancel/resume, stale state, dirty worktree, hung or long commands, flaky tests, misleading success output.
+Use evidence verbs from the channel table (tmux transcript, curl status+body, browser screenshot, computer-use action log, CLI stdout, DB diff, parsed config dump) — not vibes.
+"Tests pass" is supporting signal, NEVER completion proof. Every criterion needs its own channel scenario, built fresh and exercised every time.
+Record manual QA notes when behavior is user-visible.
+Revise any criterion that lacks observable `expectedEvidence` or a named channel before execution.
+### 3. Inspect state
+Call `goal_status` (or `hermes lithermes goal status`).
+Read pending goals, criteria IDs, current ledger head, blockers, and the aggregate objective.
+## Execution Loop
+Loop per goal. Cap at 5 cycles per goal. Cap identical same-criterion failures at 3.
+### Acquire Next Goal
+1. Call `goal_status` (or `hermes lithermes goal status`) and read the active objective, including its criteria.
+2. Inspect the bound native `/goal` snapshot injected each turn by the `pre_llm_call` hook.
+3. Apply this table exactly:
+| Active goal state | action |
+|-----------------|--------|
+| no active goal | Call `goal_set` with the brief payload. |
+| same aggregate objective active | Continue the current litgoal story. |
+| different objective active | STOP. Checkpoint blocked and surface the conflict. |
+4. If retrying failed work, re-target the failed criterion and rerun its cycle (a fresh `green` + `scenario` is required).
+5. Never call `goal_set` a second time for the same aggregate objective; layer criteria onto the existing one with `goal_add_criterion`.
+### Per-Criterion Cycle
+1. PLAN: read `criterion.scenario`, `criterion.expectedEvidence`, prior ledger entries, and safety bounds.
+2. Register atomic todos: `path: <action> for <criterion> - verify by <check>`.
+3. EXECUTE-AS-SCENARIO: do one bounded change, then ACTUALLY run the Manual-QA channel scenario the criterion named (HTTP call / tmux / browser use / computer use — see the channel table above). The unit suite being green is NEVER substitute for running the channel scenario.
+4. CAPTURE: collect the observable artifact path: transcript, stdout, screenshot, assertion, status+body, diff, or parsed dump.
+5. CLEAN (PAIRED, NEVER SKIP): tear down every runtime artifact step 3 spawned BEFORE recording — server PIDs (`kill`, verify `kill -0` fails), `tmux` sessions (`tmux kill-session -t lit-qa-<criterion>`; confirm `tmux ls`), browser / Playwright contexts (`.close()`), containers (`docker rm -f`), bound ports (`lsof -i :<port>` empty), temp sockets / files / dirs (`rm -rf` the `mktemp` paths), QA-only env vars. Embed a one-line cleanup receipt in the evidence string, e.g. `cleanup: killed 12345; tmux kill-session lit-qa-foo; rm -rf /tmp/lit.aB12cD`. Missing receipt → record BLOCKED, not PASS.
+6. RECORD exactly one result with `goal_evidence` + `goal_criterion_status` (tool form), or the CLI form:
+   - PASS: `goal_evidence {criterion_id, kind: "green"|"scenario", ref, detail}` then `goal_criterion_status {criterion_id, status: "pass"}` — evidence detail MUST include the cleanup receipt. CLI: `hermes lithermes goal evidence <cid> --kind scenario --ref "<observable> | <cleanup receipt>"; hermes lithermes goal criterion-status <cid> pass`
+   - FAIL: `goal_evidence {criterion_id, kind: "scenario", ref, detail: "<observable> | <cleanup receipt>"}` + `goal_criterion_status {criterion_id, status: "fail"}` with a diagnosis note. CLI: `hermes lithermes goal criterion-status <cid> fail`
+   - BLOCKED: `goal_evidence {criterion_id, kind: "scenario", ref, detail: "<observable>"}` + `goal_criterion_status {criterion_id, status: "blocked"}` noting the safety/blocker/leftover-state. CLI: `hermes lithermes goal criterion-status <cid> blocked`
+7. If actual does not match expected, diagnose, fix minimally, and rerun the SAME criterion (including a fresh cleanup).
+8. After 3 same-criterion failures, exit the goal with diagnosis.
+9. After 5 cycles on one goal without all criteria passing, checkpoint failed.
+10. Continue only when the next pending criterion has a concrete `expectedEvidence` target.
+### Goal Completion
+1. Confirm every criterion is `pass` with `goal_status` (or `hermes lithermes goal status`).
+2. Re-read the native `/goal` snapshot injected this turn for a fresh view.
+3. Checkpoint: `goal_checkpoint {summary: "<criteria evidence summary>", active_criterion}` (CLI: `hermes lithermes goal checkpoint "<summary>"`).
+4. If blocked or failed, record the blocker with `goal_steer`/`hermes lithermes goal blocker "<detail>"` and include diagnosis evidence before checkpointing.
+5. If this is the final goal, run the final quality gate first, then call `goal_complete` (the gate is enforced — see below).
+## Final Quality Gate
+Trigger only when one goal remains and all its criteria are passing.
+1. Run targeted verification for changed behavior.
+2. Run the `ai-slop-remover` skill on changed files. If no relevant edits exist, record a passed no-op cleaner report.
+3. Rerun verification after cleanup.
+4. Run the `review-work` skill (it fans out the review lanes via `delegate_task`).
+5. Clean review means every lane returns `verdict == "PASS"` (goal, qa, code-quality, security, context).
+6. If review is non-clean, record the blockers: `hermes lithermes goal blocker "<review findings>"` (or `goal_steer` with the review evidence). `goal_complete` stays refused while any `review_blocker` is unresolved.
+7. If clean, capture the gate summary as a checkpoint, then call `goal_complete`:
+```sh
+hermes lithermes goal checkpoint "<e2e evidence + manual QA notes>"
+hermes lithermes goal complete   # or the goal_complete tool
+```
+Gate summary to record in the checkpoint detail:
+```json
+{
+  "aiSlopRemover": { "status": "passed", "evidence": "cleaner report" },
+  "verification": { "status": "passed", "commands": ["npm test"], "evidence": "post-cleaner verification" },
+  "review": { "lanes": ["goal", "qa", "code-quality", "security", "context"], "allPass": true, "evidence": "review synthesis" },
+  "criteriaCoverage": { "totalCriteria": N, "passCount": N, "adversarialClassesCovered": ["malformed_input", "..."] }
+}
+```
+## Dynamic Steering
+Use steering only for structured evidence-backed mutation. Reject natural-language steering requests.
+| Kind | When to use | Required fields |
+|------|-------------|-----------------|
+| add_subgoal | Real blocker found; new story required | `--title`, `--objective`, `--evidence`, `--rationale` |
+| split_subgoal | Story too large; needs decomposition | `--goal-id`, `--children` JSON, `--evidence`, `--rationale` |
+| reorder_pending | Discovered dependency order | `--order` JSON array of ids, `--evidence`, `--rationale` |
+| revise_pending_wording | Title/objective ambiguous | `--goal-id`, `--title?`, `--objective?`, `--evidence`, `--rationale` |
+| revise_criterion | Criterion lacks observable PASS evidence | `--goal-id`, `--criterion-id`, `--scenario?`, `--expected-evidence?`, `--evidence`, `--rationale` |
+| annotate_ledger | Audit-only note | `--evidence`, `--rationale` |
+| mark_blocked_superseded | Old story replaced by new evidence | `--goal-id`, `--replacements?`, `--evidence`, `--rationale` |
+Tool form: `goal_steer {kind, <kind-specific-fields>, evidence, rationale}`.
+CLI form: `hermes lithermes goal steer --kind <kind> [<kind-specific-fields>] --evidence "<...>" --rationale "<...>"`.
+Structured prompt directives accepted: `LITHERMES_GOAL_STEER: { ... }`, `lithermes.goal.steer: {...}`, `hermes lithermes goal steer: {...}`.
+## Constraints
+1. NEVER mutate the bound native objective mid-aggregate; only finalize with `goal_complete` after the quality gate passes.
+2. NEVER call `goal_set` a second time when the active objective differs; layer with `goal_add_criterion` instead.
+3. NEVER set `criterion.status == "pass"` without captured observable evidence recorded via `goal_evidence`.
+4. NEVER bypass the criteria gate: `goal_complete` is refused until every criterion is `pass` with both a `green` and a `scenario` evidence entry.
+5. Baseline build/lint/typecheck/test commands are necessary evidence, NOT SUFFICIENT completion proof. Criteria coverage with observable evidence is the gate.
+6. Treat `.hermes/lithermes/litgoal/ledger.jsonl` as the durable audit trail; checkpoint after every success or failure.
+7. Keep one native objective for the whole aggregate; litgoal layers durable criteria/evidence on top — do not duplicate it per story.
+8. Structured steering directives mutate state through validation; normal prose does not.
+9. Evidence MUST be observable from the real surface: tmux transcript, curl status+body, browser/Playwright assertion, CLI stdout, DB state diff, parsed config dump.
+10. Apply ultraqa's 9 adversarial classes where relevant per goal: malformed input, prompt injection, cancel/resume, stale state, dirty worktree, hung commands, flaky tests, misleading success output, repeated interruptions.
+11. After completing an aggregate litgoal run, clear the bound objective with `/goal clear` before starting another in the same session.
+12. The CLI and goal tools write durable state under `.hermes/lithermes/litgoal/`; the `pre_llm_call` hook injects the active goal + unmet criteria + gate status into every turn.
+13. NEVER set `pass` while a QA-spawned process, `tmux` session, browser context, bound port, container, or temp file / dir is still alive. The evidence detail MUST include the cleanup receipt. Leftover runtime state = BLOCKED, not PASS.
+## Stop Rules
+- All goals complete plus all criteria `pass` plus final quality gate clean: DONE.
+- 3x same criterion failure: checkpoint failed, surface diagnosis.
+- 5 cycles on one goal without all-pass: checkpoint failed, surface.
+- Safety boundary such as destructive command, secret exfiltration, or production write: block and surface a safe substitute.
+- The bound native objective reports a different active goal: checkpoint blocker, stop, surface.
+- Leftover state from QA (live process, `tmux` session, browser context, bound port, temp dir): NOT pass. Clean up, append the receipt, then continue.
+- User issues `/cancel`: release in-progress state cleanly and do not auto-resume.