npm - @uzysjung/agent-harness - Versions diffs - 26.86.0 → 26.88.0 - Mend

@uzysjung/agent-harness 26.86.0 → 26.88.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/README.ko.md +2 -2
package/README.md +13 -0
package/dist/{chunk-EKLV22W3.js → chunk-QHYH6P32.js} +73 -1
package/dist/chunk-QHYH6P32.js.map +1 -0
package/dist/index.js +155 -11
package/dist/index.js.map +1 -1
package/dist/trust-tier-drift.js +1 -1
package/package.json +1 -1
package/templates/skills/asis-tobe-decision/SKILL.md +161 -0
package/templates/skills/compaction-handoff/SKILL.md +178 -0
package/templates/skills/gap-analysis-e2e/SKILL.md +241 -0
package/templates/skills/multi-persona-review/SKILL.md +211 -0
package/templates/skills/northstar-roadmap/SKILL.md +176 -0
package/templates/skills/ultracode-service-audit/SKILL.md +224 -0
package/dist/chunk-EKLV22W3.js.map +0 -1

package/templates/skills/multi-persona-review/SKILL.md ADDED Viewed

@@ -0,0 +1,211 @@
+---
+name: multi-persona-review
+description: >-
+  A panel-review skill that critiques ONE artifact (launch post, README, doc, markdown, plan,
+  design) via 3-5 disjoint user-perspective personas running in parallel, then synthesizes deduped,
+  severity-ranked improvement points (P0/P1/P2). Use when the user says "작성글을 사용자 관점의
+  페르소나를 여러명 만들어서 (손넷 모델정도로) 피드백 받아바", "다면 리뷰 해볼까", "페르소나로 리뷰",
+  "여러 관점으로 피드백", or in English "multi-persona review", "review this from different user
+  perspectives", "get persona feedback on this post/README/doc", "panel review this artifact".
+  Lighter than a full service audit — point it at ONE artifact, not a whole codebase. NOT for a
+  whole-codebase multi-dimension audit (use ultracode-service-audit) or a single-axis
+  gap-vs-benchmark loop (use gap-analysis-e2e).
+---
+# Multi-Persona Review (다면페르소나 워크플로우 리뷰)
+Run a small panel of realistic target-user personas over one artifact, independently and in
+parallel, then synthesize their findings into a deduped, prioritized fix list. This is how the
+user actually works: "작성글을 사용자 관점의 페르소나를 여러명 만들어서 손넷 모델정도로 피드백 받아바"
+and "이부분도 다면 리뷰 해볼까?" — 4-5 Sonnet-tier personas across 1-2 passes over a launch post,
+yielding P0~P2 prioritized fixes.
+## When to use
+- A draft is "done" but you want blind spots an author is fatigue-blind to: launch post, README,
+  PRD/plan, doc, marketing copy, a design.
+- The user names personas or "다면 리뷰" / "여러 관점" / "multi-persona" / "panel review".
+- You want **reproducible, severity-ranked** feedback, not one reviewer's gut reaction.
+Do **not** use this for whole-codebase quality work — that's `ultracode-service-audit`. This skill
+is deliberately lighter: one artifact, one panel, one synthesis. For surfacing missing user
+journeys end-to-end, this feeds the UX lens of `gap-analysis-e2e`.
+## Why a panel beats one reviewer (the evidence)
+The whole method rests on one empirical fact: **independent reviewers find largely
+non-overlapping problems.**
+- **Heuristic Evaluation (Nielsen & Molich) + the 3-5 evaluator rule** — a single evaluator
+  catches only ~35% of usability issues; aggregating independent evaluators raises coverage to
+  ~85% at five, with sharp diminishing returns beyond. The value comes from *low overlap between
+  perspectives*, not any one reviewer being thorough. Some of the hardest issues are found by an
+  evaluator who otherwise finds few. Each judges against the *same explicit checklist* so reviews
+  stay comparable and dedupable.
+  https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/theory-heuristic-evaluations/
+- **Panel of LLM evaluators (PoLL)** — a panel of several smaller, *disjoint* judges beats one
+  large judge, shows less self-preference bias, and costs ~7x less. This is the cost-tier reason
+  the user runs the persona panel at Sonnet tier and reserves the main model for orchestration and
+  synthesis. https://arxiv.org/abs/2404.18796
+- **"Nine Judges, Two Effective Votes"** — panels help *only to the extent members fail
+  independently*. A 9-judge panel carried only ~2 independent votes' worth of information because
+  the models made the same mistakes on the same items. The bottleneck is **correlated reviewers,
+  not panel size or aggregation math** — so persona design must maximize genuine viewpoint
+  diversity, not nominal count. https://arxiv.org/abs/2605.29800
+- **LLM-as-persona-reviewer vs human experts (GPT-4o study)** — persona review finds many real
+  issues but also emits false positives humans wouldn't flag, and misses issues needing embodied
+  experience. Recommended posture: a **hybrid** where personas generate candidate findings that a
+  human validates — never a replacement for human judgment. https://arxiv.org/pdf/2506.16345
+- **RICE prioritization (Intercom)** — (Reach × Impact × Confidence) / Effort turns rough guesses
+  into one comparable score, down-weighting low-confidence/high-effort items and countering the
+  reviewer's bias toward what they'd personally use. A lightweight analog gives a *defensible,
+  reproducible* map from findings to P0/P1/P2.
+  https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/
+## Core workflow
+### 1. Frame the artifact (orchestrator, main model)
+Capture three things the personas will all share:
+- **Goal** — what is this artifact trying to achieve? (e.g. "get a developer to `npx` install in
+  under 2 minutes and star the repo")
+- **Audience** — who is the real target reader?
+- **Rubric** — the shared checklist every persona scores against, so findings are comparable and
+  dedupable. Default rubric (adapt to the artifact): *clarity of value prop · first-action
+  friction · credibility/trust signals · scannability · accuracy/honesty · accessibility ·
+  call-to-action*. Without a shared rubric, red-team reviews decay into proofreading and generic
+  opinions, and findings stop being comparable across personas.
+### 2. Design 3-5 genuinely disjoint personas
+Cap the panel at five — coverage flattens beyond that, and extra personas mostly inflate tokens
+and false confidence (the "Nine Judges" trap). Engineer **diversity, not count**: pick personas
+with disjoint goals, contexts, and *failure-fears* so their blind spots don't correlate. A strong
+default spread:
+| Persona | Lens / what they fear |
+|---|---|
+| Skeptical newcomer | Doesn't know the domain; fears wasting time on hype. Tests "do I get it in 10s?" |
+| Time-pressured expert | Knows the domain; fears fluff between them and the command. Tests scannability + first action. |
+| Accessibility-dependent user | Screen reader / low vision / non-native reader. Tests structure, alt text, plain language. |
+| Hostile/adversarial reader | Looks for overclaims, vague benefits, anything to dismiss. Tests honesty + credibility. |
+| Adjacent-tool migrant *(optional 5th)* | Already uses a competitor. Tests differentiation + "why switch?". |
+Swap personas to fit the artifact (e.g. for a PRD: implementing engineer, on-call SRE, PM,
+security reviewer). The test is always: would these two personas make the *same* mistake? If yes,
+they're not independent — replace one.
+### 3. Review in parallel, independently (Sonnet-tier panel)
+Spawn one sub-agent per persona via the **Task tool** (or the harness's sub-agent mechanism). Each
+one gets the artifact + goal + audience + the *same* rubric, and **must not see the other personas'
+output** — independence is the precondition that makes aggregation add information. Anchoring on a
+peer collapses the panel toward one effective vote.
+Prefer pinning the persona sub-agents to a cheaper tier (Sonnet) — see the cost-tier note. But this
+degrades gracefully: if the harness can't pin sub-agents to a specific model, just run the panel on
+the default sub-agent model and note in the step-6 coverage caveat that the panel ran at the
+orchestrator tier. The tier is an economy, not a hard prerequisite.
+Each persona returns findings as **strengths / weaknesses / specific recommendations**. Require
+every finding to be specific and actionable: **quote the offending passage and propose a concrete
+fix.** Ban vague "needs work" notes — that's the classic red-team failure mode (briefing +
+structured findings + independence are the load-bearing parts, not the critical attitude).
+https://loopio.com/blog/red-team-review/
+### 4. Synthesize: dedupe, but preserve minority findings (orchestrator, main model)
+Collapse overlapping findings into one entry, noting *how many personas raised it* (frequency is a
+prioritization signal). **But never drop a single-persona finding** — heuristic-evaluation data
+says the hardest, most valuable issues are often raised by only one reviewer. Majority-vote /
+consensus filtering would silently discard exactly those. Keep them, tagged as single-source.
+### 5. Prioritize with a transparent rule → P0/P1/P2
+Map each finding to a bucket with a **reproducible** rule, not by gut feel or by which persona
+phrased it loudest. Use a RICE-style or **severity × frequency** score:
+- **P0** — blocks the artifact's goal for many readers (e.g. value prop unreadable in first
+  screen; a false claim). High impact × high confidence, any effort.
+- **P1** — meaningfully hurts conversion/trust but has a workaround.
+- **P2** — polish, edge-reader, or low-confidence/high-effort items.
+Show the score inputs so the ranking is auditable.
+### 6. Triage as candidates, state coverage honestly
+Present the list as **candidate findings needing a validation pass**, not gospel. Flag likely
+false positives and note where real-user confirmation is warranted before committing fixes — LLM
+personas both miss embodied issues and invent non-issues. End with an honest coverage caveat: a
+panel never finds every issue and offers no systematic fix generation (Nielsen's own caveat).
+Claiming exhaustiveness here would be a no-false-ship violation.
+**Second pass (the "1-2 passes"):** run the same panel again *after fixes land* to confirm the P0s
+are actually closed and that the edits didn't introduce new issues. One pass to find, one to verify
+— a third rarely pays off.
+## Worked example (Input → Output)
+**Input:** Trigger — "이 런치 포스트 다면 리뷰 해볼까? 손넷으로 페르소나 4명." Artifact: a launch
+post for an npm installer CLI. Goal: "reader runs `npx ... init` and stars the repo." Audience:
+indie devs scanning a feed.
+**Panel (parallel, Sonnet tier):** skeptical newcomer · time-pressured expert ·
+accessibility-dependent reader · hostile reader.
+**Raw findings (excerpt):**
+- Newcomer: "Paragraph 1 says 'context-engineered harness' — I don't know what that buys me.
+  Quote: *'A context-engineered harness for agentic CLIs.'* Fix: lead with the outcome — *'Install
+  vetted plugins, skills, and rules across 4 AI CLIs in one command.'*"
+- Expert: "The install command is below three paragraphs of philosophy. Fix: move `npx` line to
+  the first screen." *(also raised by newcomer → frequency 2)*
+- Accessibility: "Demo is a GIF with no text fallback; the actual command only appears in the GIF.
+  Fix: put the command in a code block as text."
+- Hostile: "'Works everywhere' — claims 4 CLIs but only shows Claude. Fix: either show all four or
+  soften to 'Claude today, others in progress.'" *(single-source, kept)*
+**Synthesized + prioritized output:**
+| ID | Finding (deduped) | Personas | Sev × Freq | Bucket |
+|---|---|---|---|---|
+| F1 | Install command buried below the fold / inside GIF only | expert, newcomer, a11y | high × 3 | **P0** |
+| F2 | Value prop is jargon, not outcome, in first screen | newcomer | high × 1 | **P0** |
+| F3 | "Works everywhere" overclaims vs. evidence shown | hostile | med × 1 | **P1** |
+| F4 | Demo GIF has no text alternative | a11y | med × 1 | **P1** |
+**Caveat returned to user:** candidate findings from a 4-persona Sonnet panel; F3 (overclaim) is
+worth confirming against what the post can actually demo before rewording. Not exhaustive — a real
+indie-dev read may surface more.
+This mirrors the user's real run (memory: `persona-feedback-improvements`, P0-before-publish gate).
+## Cost-tier note
+Run the **persona panel at a cheaper tier (Sonnet)** — PoLL shows a disjoint panel of smaller
+judges beats one big judge at a fraction of the cost. Reserve the **main/orchestrator model** for
+framing the rubric and synthesizing (steps 1, 4-6), where reasoning quality pays off most.
+## Pitfalls to avoid
+- **False diversity** — personas that share the model's default assumptions give far fewer than N
+  views. Design for disjoint fears; if two would make the same mistake, replace one.
+- **Scaling count to fix quality** — past ~5 personas you mostly buy tokens and noise. Fix
+  independence, not size.
+- **Consensus filtering** — dropping single-persona findings discards the rare, hard issues that
+  are the whole point.
+- **Anchoring** — letting personas see each other's output before judging collapses the panel.
+- **Opaque P0/P1/P2** — ranking by vibe or loudest wording is unauditable. Show the score.
+- **Over-claiming coverage** — report it as candidate findings, never "found everything."
+## Cross-references
+- `ultracode-service-audit` — full multi-dimensional audit of a whole service/codebase; this skill
+  is the lighter, single-artifact UX lens.
+- `gap-analysis-e2e` — this skill feeds its UX/user-journey lens.
+- `critique` — design-specific persona critique with anti-pattern detection; reach for it when the
+  artifact is a UI rather than prose/markdown.
+> This SKILL.md is complete and self-contained — everything needed to run a panel is above. If the
+> method ever needs deeper appendices (full default rubrics per artifact type, persona prompt
+> templates, a RICE scoring worksheet), a `reference/` file alongside this SKILL.md is the place to
+> add them. That's a future-extension option, not a missing dependency.

package/templates/skills/northstar-roadmap/SKILL.md ADDED Viewed

@@ -0,0 +1,176 @@
+---
+name: northstar-roadmap
+description: >-
+  Read the project's NORTH_STAR / vision doc, measure current state against the goal, then
+  propose a forward direction plus prioritized feature proposals — persisted as a durable
+  roadmap in docs/plans + memory so the plan survives /compact and new sessions. Use when
+  the user asks where the project should go next or wants a backlog grounded in the vision.
+  Fires on the user's real phrasings: "앞으로 어떤 방향으로 개선·발전시킬지 고민해봐",
+  "NORTH.md / NORTH_STAR 보고 나아갈 방향 + 기능 제안", "나아갈 방향 + 기능제안 (수용 → 계획 수립하고 메모리에 기록)",
+  "북극성 정렬 로드맵", as well as the English equivalents: "what direction should we take next",
+  "propose a roadmap / feature backlog from the north star", "plan the next milestones and save it
+  to memory". Not for detecting bugs or auditing current quality (see gap-analysis-e2e /
+  ultracode-service-audit) — this skill DIRECTS forward planning.
+---
+# North-Star Roadmap (북극성 정렬 로드맵 + 기능 제안)
+Turn a vision document into a forward direction and a ranked feature backlog, then write it
+somewhere durable. The point is alignment, not idea generation: every proposal must trace
+upward to the north-star, and the result must outlive the conversation that produced it.
+## When to use
+Reach for this skill when the user steps back from day-to-day work and asks where the project
+should head — typically with one of these (their actual phrasings):
+- "앞으로 어떤 방향으로 개선·발전시킬지 고민해봐"
+- "NORTH.md / NORTH_STAR 보고 나아갈 방향 + 기능 제안"
+- "(제안) 수용 → 계획 수립하고 메모리에 기록"
+- English: "what direction next", "propose a roadmap from the north star", "save the plan to memory"
+Do **not** use it to find what's broken right now. Detecting defects, gaps, or quality regressions
+is the job of the sibling skills below; this skill consumes their findings and points forward.
+## Why these steps (the frameworks underneath)
+The workflow chains four established product-strategy methods so the output is defensible rather
+than vibes. Reason with each — don't just cite it:
+- **North Star Framework** (Amplitude) — a single North Star Metric is the destination; 3–5
+  directly-influenceable *Inputs* are the levers. You assess "current vs goal" against the inputs
+  (leading indicators teams can move), not lagging vanity numbers.
+  https://amplitude.com/books/north-star/about-north-star-framework
+- **Working Backwards / PR-FAQ** (Amazon) — for a major proposal, sketch the future end-state first
+  (a one-line "press release" of the value the user gets), then derive the features. This forces
+  clarity and stops "we can build X because we know how" reasoning.
+  https://workingbackwards.com/concepts/working-backwards-pr-faq-process/
+- **OKR lineage, not cascade** (Gothelf) — every roadmap item must have a *parent* it supports in
+  the north-star. Items invented bottom-up that don't ladder up get cut. This is the core alignment test.
+  https://jeffgothelf.com/blog/aligning-not-cascading-okrs-with-an-okr-lineage/
+- **RICE / ICE scoring** (Intercom) — rank proposals by `(Reach × Impact × Confidence) / Effort`
+  (RICE), or `Impact × Confidence × Ease` (ICE) when data is thin. Confidence is where you honestly
+  discount exciting-but-unproven ideas. Scores are *inputs to a decision, not the verdict* — log
+  every strategic override.
+  https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/ ·
+  https://agileseekers.com/blog/feature-prioritization-using-rice-and-ice-models-in-product-roadmaps
+- **Theme-based Now / Next / Later** — organize the roadmap by outcome themes and horizons, not
+  dated feature promises, so it ages gracefully and the why/what stays above the how/when.
+## Core workflow
+### 1. READ the north-star and restate it as Metric + Inputs
+Read the project's vision doc (here: `docs/NORTH_STAR.md` — it already defines the North Star
+Statement, the NSM, and measured Inputs). Restate the goal as **one North Star Metric + 3–5
+influenceable Inputs**. If the doc already has them, lift them; if it only has a prose vision,
+derive a candidate set and show it for confirmation.
+Sanity-check the metric before trusting it:
+- Is it a **leading** indicator of value, or a lagging one (raw revenue, registered users, page
+  views)? Lagging metrics are "what's done is done" — you can't steer by them.
+- Is it **gameable**? "If you can move it directly without delivering value, it's not a good
+  north-star." Flag it instead of silently planning against a corrupt target.
+> In this repo the literal north-star is **GitHub stars** (per memory and the service-audit
+> roadmap), with the NORTH_STAR NSM (HITO ≤ 3/feature, low re-clarification) as the *value* the
+> stars are supposed to reward. Plan toward stars **via** the value inputs, not by gaming the count.
+### 2. ASSESS current state against each Input — expose the gap
+For each Input, state where the project is today vs target, using real evidence (existing plans,
+audit output, metrics, code state). The deliverable is the **gap**: the distance between now and
+the north-star, per lever. Be honest about unknowns — an unmeasured input is a gap too.
+### 3. PROPOSE direction + features by working backwards
+First name the **forward direction** in a sentence or two — the theme(s) that close the biggest
+gaps. Then, for each significant proposal:
+- Write a one-line mini-PR (the future state: who gets what value once it ships).
+- List the concrete feature(s) that realize it.
+- State its **parent** — which Input / north-star pillar it supports. **No parent → cut it.**
+  This is the alignment gate that prevents bottom-up feature churn.
+### 4. PRIORITIZE with RICE (or ICE) and record overrides
+Score each proposal. Use RICE when you have reach/effort signal; ICE for a thin-data first pass.
+Write the numbers down so the ranking is auditable. Then apply judgment: dependencies, strategic
+table-stakes, and north-star fit may override a score — **log WHY** for every override (the same
+honesty the repo's `no-false-ship` and ADR/Decision-Log rules demand). Treat the score as a
+decision aid, never an autopilot.
+### 5. PERSIST as a Now / Next / Later roadmap in docs + memory
+Write the result to a durable artifact so it survives `/compact` and new sessions:
+- **Update the existing roadmap SSOT** under `docs/plans/` (here: `service-audit-roadmap.md`) in
+  place, organized by outcome themes in **Now / Next / Later** horizons — not dated commitments.
+  Prefer one living doc over spawning a dated parallel (e.g. `northstar-roadmap-2026-06.md`): two
+  roadmaps drift. Only create a separate dated horizon doc if the user explicitly asks for one.
+- A short anchor line in `memory/MEMORY.md` pointing at that doc, so it's re-referenced at session
+  start as a Persistent Anchor (per CLAUDE.md "SPEC/PRD는 매 세션 시작 시 재참조").
+- If a real architectural decision was made (e.g. dropping a direction), record an ADR under
+  `docs/decisions/`, numbered as the next free integer (check the highest existing `ADR-NNN`
+  first), per `change-management.md`.
+Then report back: the direction, the ranked table, and the file paths written. Mark it a **living
+document** — both the metric and the priorities are expected to be revised as strategy evolves.
+## Worked example
+**Input (user):** `/northstar-roadmap NORTH_STAR 보고 나아갈 방향 + 기능 제안 → 계획 세워서 메모리에 기록`
+**Output (abridged):**
+> **1. Metric + Inputs** (from `docs/NORTH_STAR.md`)
+> North Star Metric: GitHub stars as the adoption signal that the harness delivers real value.
+> Inputs: (a) HITO ≤ 3 / feature, (b) Promise = Implementation (advertised assets 100% work),
+> (c) first-run install success, (d) vetted+security-audited asset count, (e) 4-CLI parity.
+> *Leading-indicator check:* stars alone are gameable, so we plan against inputs (a)–(e), the value
+> the stars reward — not the raw count.
+>
+> **2. Gap** — (a) HITO unmeasured in fresh installs · (b) closed after no-false-ship work · (c)
+> install-matrix green but no post-publish telemetry · (e) plugins don't yet reach non-Claude CLIs.
+>
+> **3. Direction** — "Make the *first 60 seconds* prove the value before asking for a star."
+> Proposals: P1 first-run telemetry (parent: Input c); P2 plugin→codex/opencode equivalence
+> (parent: Input e); P3 public, continuously-tested compat+security matrix artifact (parent: Input d).
+> Each carries a one-line mini-PR.
+>
+> **4. RICE** — P1 (R 5·I 2·C .8·E 2 = 4.0) · P3 (R 4·I 3·C .7·E 3 = 2.8) · P2 (R 2·I 3·C .6·E 5 = 0.7).
+> *Override:* P3 promoted above its score — it's the repo's defensibility wedge (table-stakes for
+> a 0-star cold start). Logged here and in the ADR.
+>
+> **5. Persisted** — updated the SSOT `docs/plans/service-audit-roadmap.md` in place (Now: P1 ·
+> Next: P3 · Later: P2), added the anchor to `memory/MEMORY.md`, opened the next free ADR
+> `docs/decisions/ADR-0NN-p3-override.md` for the strategic override.
+## Pitfalls to design against
+- **Vanity / lagging / gameable north-star** — the two failure modes to refuse: picking a metric
+  you can't influence (revenue, raw users) or one you can move without delivering value.
+- **False precision in RICE/ICE** — subjective Reach/Impact/Effort treated as exact truth. Confidence
+  exists to discount shaky estimates; skipping it yields authoritative-looking wrong rankings.
+- **Score on autopilot** — shipping the top-RICE item while ignoring dependencies or strategic fit.
+- **Dated feature-list roadmap** — timeline promises rot; outcome themes in Now/Next/Later age better.
+- **Bottom-up idea dump** — proposals that don't ladder up to an Input. The alignment gate (step 3)
+  is the cure.
+- **Plan that doesn't persist** — a great assessment that lives only in the chat and is lost at
+  `/compact`. The artifact in step 5 is the whole point.
+## Cross-references (siblings — do not duplicate)
+- **gap-analysis-e2e** — *detects* north-star gaps end-to-end. This skill consumes those gaps as the
+  evidence in step 2.
+- **ultracode-service-audit** — produces a multi-dimension audit and roadmap of *current* problems.
+  This skill takes that roadmap as input and points it forward.
+- **strategic-compact** / project ADR + plan-SSOT conventions — the persistence mechanism (step 5)
+  reuses them rather than reinventing.
+> Audit and gap skills answer "what's wrong now?". This skill answers "where do we go, and in what
+> order?" — and makes the answer durable.
+## Reference (progressive disclosure)
+This SKILL.md is the operating summary. If deeper method detail is ever needed — full RICE worked
+calculations, a PR-FAQ template, or a roadmap-doc skeleton — add a `reference.md` beside this file
+and link it here. Keep SKILL.md lean; the user dislikes verbose notepad docs.

package/templates/skills/ultracode-service-audit/SKILL.md ADDED Viewed

@@ -0,0 +1,224 @@
+---
+name: ultracode-service-audit
+description: >-
+  Run a multi-agent, adversarially-verified full-service audit across 7 dimensions
+  (code / UX / scalability / planning+north-star / security / promotion / extensible),
+  separating findings into confirmed / unverified / rejected and producing a
+  priority-ranked, M-numbered milestone roadmap (as many milestones as the findings warrant).
+  Use when the user says
+  "ultracode 전체 서비스 점검", "전체 서비스를 점검하자", "코드·UX·확장성·기획·북극성지표·보안·홍보 문제점을 파악하고 우선순위에 따라 개선",
+  "다차원 서비스 감사", or in English "audit the whole service / full multi-dimensional service audit /
+  find code, UX, scalability, planning, security, and marketing problems and prioritize fixes".
+  The heavyweight superset audit — orchestrate it as a Workflow with fan-out finders and an adversarial verify pass.
+  NOT for a single-artifact prose/README review (use multi-persona-review) or a single-axis
+  gap-vs-benchmark loop (use gap-analysis-e2e) — those are the lighter siblings.
+---
+# Ultracode Service Audit
+The heavyweight, multi-agent audit of an *entire* service across many dimensions at once.
+The fan-out is orchestrated as a **Workflow / multi-agent run, and it can be large** — the real
+run drove many agents in parallel, not a 7-agent minimum. "ultracode" implies that heavyweight
+parallelism: a finder (often several) per dimension plus a separate squad of verifiers. Where a
+single skill inspects one axis (UX, or code, or strategy), this one fans out finder agents per
+dimension, then runs a **separate adversarial verification pass** so that only findings that
+survive cross-examination are reported as real. The output is one priority-ranked roadmap where
+every item is dimension-tagged, evidence-graded, and traceable to the product's North Star.
+This is the skill behind the user's real request (turn 94):
+> "ultracode 현재까지 개발한 내용을 기준으로 전체 서비스를 점검하자. 코드상 문제, UX 상 문제,
+> 확장성 문제, 기획 및 북극성지표, 보안상, 홍보상의 문제점을 파악하고 각각의 개선점 ...
+> 우선순위에 따라 개선하자"
+That run produced **확정 29 / 미검증 0 / 기각 8** — the confirmed/unverified/rejected split
+is not decoration, it is the whole point. The **미검증 0 was *that run's* outcome, not a
+guarantee the bucket goes unused**: the 미검증 bucket is load-bearing and stays in the report
+the moment any finding lands with no verifier votes. A finding nobody could verify never gets
+reported as fact.
+## When to use
+- The user wants a **whole-service health check**, not one narrow review — "전체 서비스를 점검하자",
+  "다차원 감사", "audit everything before launch / before we promote".
+- You have the **Workflow / ultracode multi-agent capability** available (this skill assumes you
+  can fan out independent agents and re-aggregate). Without it, fall back to running the
+  dimensions sequentially yourself — but step 3 forbids a finder from grading itself, so
+  sequential mode **cannot produce a true 확정**. In that mode, never emit 확정 verdicts: label
+  every finding 미검증 or evidence-backed-only (a failing test / exposed secret / reproduced
+  crash counts; an opinion does not), and state plainly in the report that no independent
+  verification ran. That keeps the no-false-ship invariant honest when fan-out is absent.
+- You need an output that is **prioritized and trustworthy** — every claim graded, weak claims
+  sunk in the ranking, nothing over-claimed.
+If the user only wants one axis, use the focused sibling instead (see Cross-references). This
+skill is the superset; don't reach for it when a scalpel will do.
+## The seven dimensions (extensible)
+Each dimension gets a **named, enumerated rubric** before any auditing starts, so findings are
+checked against explicit criteria rather than vibes. This is the discipline behind heuristic
+evaluation (Nielsen / NN/g): a violation of a named rule is a *candidate* defect, justified
+against context — not an automatic one.
+| # | Dimension | Rubric to hand the finder agent |
+|---|-----------|----------------------------------|
+| 1 | **Code** | correctness/logic, security (injection, authz, exposed secrets), readability, tests-that-fail-when-logic-breaks, design/architecture fit (SonarSource multi-axis review) |
+| 2 | **UX** | Nielsen's 10 usability heuristics; rate severity by impact, not by rule-match count |
+| 3 | **Scalability** | data-model limits, hot paths, statefulness, single points of failure, cost-per-unit growth |
+| 4 | **Planning + North Star** | the one North Star metric + its Inputs (Amplitude); does each finding move the metric or an Input? |
+| 5 | **Security** | secrets exposure, authz boundaries, dependency CVEs, input trust, data egress |
+| 6 | **Promotion / Marketing** | Working-Backwards: take the product's *implied PR/FAQ* (its promised value) and audit whether the built service + its messaging actually deliver — surface over-claim / false-ship gaps |
+| 7 | **+ Extensible** | add a dimension by giving it (a) a named rubric and (b) its own independent verifier. Nothing else changes |
+The framework set is load-bearing, not ornamental:
+- **Heuristic Evaluation / Nielsen's 10** — named criteria per dimension, independent evaluators,
+  severity by impact. <https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/>
+- **Multi-axis code review** — distinct correctness/security/tests/design axes beat one
+  "looks good". <https://www.sonarsource.com/resources/library/code-review/>
+- **North Star Framework** — anchors dimension 4; filters strategically immaterial noise.
+  <https://amplitude.com/books/north-star/about-north-star-framework>
+- **Working Backwards (PR/FAQ)** — the inverse-test for promotion + planning: promised value vs
+  delivered. <https://workingbackwards.com/concepts/working-backwards-pr-faq-process/>
+## Core workflow
+Orchestrate this as a **Workflow**: fan-out → adversarial verify → synthesize.
+### 1. Scope and set North Star (pre-flight)
+Read the service's SPEC/PRD/NORTH_STAR and recent state. Name the North Star metric and its
+Inputs explicitly — they are the strategic anchor every finding will be tested against. If you
+can't state the North Star, stop and ask; auditing dimensions in isolation with no anchor just
+generates busywork.
+### 2. Fan-out: independent finder(s) per dimension
+Spawn at least one finder agent for each dimension with **its own rubric** (table above) — and
+spawn several per dimension where the surface is large. This is the heavyweight step: a real run
+fans out to **many agents in parallel**, not a fixed seven. Run them
+**independently** — NN/g's finding is that independent passes catch issues a single pass misses
+(3 evaluators ≈ 60% of issues; one agent per dimension is not enough on its own, which is why
+step 3 exists). Each finder returns candidate findings with: dimension tag, the rubric item
+violated, the evidence it actually observed, and a proposed severity.
+### 3. Adversarial verify pass (the load-bearing step)
+This is **a distinct second pass**, not the finders grading themselves. Re-order the reviewer
+agents and give them *diverse* prompts/roles, then task them with peer-reviewing every round-one
+assertion. This is Multi-Agent Verification (BoN-MAV): reliability scales at test time by
+running multiple independent verifiers and accepting only what survives cross-validation.
+<https://arxiv.org/pdf/2502.20379>
+Decision rule per finding:
+- **Confirmed (확정)** — survives verification, or carries irrefutable evidence (a failing test,
+  an exposed secret, a reproduced crash). Verifier consensus, not one voice.
+- **Unverified (미검증)** — **0 adversarial-verify votes** and no hard evidence. Kept in a
+  separate bucket. **Never reported as fact.** This is the no-false-ship invariant
+  (`.claude/rules/no-false-ship.md`).
+- **Rejected (기각)** — majority of verifiers refute it (rubric-match without real defect, wrong
+  reasoning, already handled). A majority refute *kills* the finding.
+> Engineer verifier diversity deliberately. MAV names the failure mode that breaks the whole
+> ensemble: **correlated-verifier collapse** — if every reviewer shares the same model, prompt,
+> and blind spot, the adversarial pass rubber-stamps wrong findings and hands you false
+> confidence. Vary roles, ordering, and prompts so verifiers don't share blind spots. If you
+> cannot achieve independence, say so in the report and downgrade your confidence accordingly.
+### 4. Cap the loops
+Verification cost scales with verifier count and debate rounds. Set a hard ceiling on
+revision/debate iterations (this is a `gates-taxonomy` Revision gate — iteration cap mandatory)
+and **escalate to the user rather than loop forever** on a contested finding. Unbounded debate
+buys diminishing returns at runaway token cost.
+### 5. Score surviving findings with RICE
+Rank confirmed (and any carried-forward unverified) findings with **Reach × Impact × Confidence
+/ Effort** = impact per time worked.
+<https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/>
+The **Confidence multiplier is where the verification tier pays off** — map it directly:
+| Verdict | RICE Confidence |
+|---------|-----------------|
+| Confirmed + hard evidence | 100% |
+| Confirmed by verifier consensus | 80% |
+| Unverified (carried, not dropped) | 50% or lower |
+This makes weakly-evidenced findings **sink in the ranking automatically** — you carry them
+honestly instead of either pretending they're certain or silently deleting them. Rank by raw
+severity or gut feel and high-impact-but-unproven items jump the queue; the Confidence term
+exists precisely to stop that.
+### 6. Synthesize the M-numbered roadmap
+Cluster surviving findings (affinity-style), trace each to the North Star or an Input (drop the
+strategically immaterial), and emit a milestone roadmap with **as many milestones as the
+findings warrant** — M1, M2, … however far the work runs (the real run landed at M4). There is
+no fixed milestone count; severity and clustering decide it. Every roadmap item carries:
+**dimension tag · verdict · RICE score · North-Star linkage · evidence pointer.** Items that
+don't move the metric or an Input are flagged as nice-to-have, not milestone-blocking.
+## No-false-ship evidence matrix
+Because this audit *itself* can over-claim, report each dimension's verification the same way the
+repo's `no-false-ship` rule demands for shipped features — per-path evidence, unverified shown as
+unverified, never one path's evidence reused for another:
+```
+| Dimension   | Finder evidence            | Verifier outcome        | Verdict   |
+|-------------|----------------------------|-------------------------|-----------|
+| Code        | failing test repro'd       | 3/3 verifiers confirm   | 확정      |
+| UX          | heuristic #4 violation     | 2/3 confirm, context ok | 확정      |
+| Security    | suspected authz gap        | 0 verifier votes        | 미검증    |
+| Promotion   | README claim vs built      | majority refute         | 기각      |
+```
+A row with no verifier votes stays "미검증" in the final report. Hiding it and declaring "audit
+complete" is exactly the false-ship failure this skill exists to prevent.
+## Worked example (Input → Output)
+**Input** (user, verbatim trigger):
+> "ultracode 전체 서비스를 점검하자 — 코드·UX·확장성·기획·북극성지표·보안·홍보 문제점 파악하고
+> 우선순위에 따라 개선하자."
+**Process:**
+1. Pre-flight: North Star = "weekly successful first-install completions"; Inputs = wizard
+   completion rate, CLI flag coverage, install success rate.
+2. Fan-out: 7 finders, each with its rubric. ~40 raw candidate findings.
+3. Adversarial verify (re-ordered, diverse verifiers): 29 survive, 8 refuted, several land in
+   미검증 with 0 votes and stay there.
+4. Loop cap hit on one contested scalability claim → escalated to user, not debated to death.
+5. RICE: a confirmed-with-failing-test code bug (Confidence 100%) outranks a plausible-but-
+   unverified marketing gap (Confidence 50%) even though the marketing gap *felt* bigger.
+6. Synthesize.
+**Output** (abridged):
+```
+Service Audit — 확정 29 / 미검증 (carried) N / 기각 8
+M1 (now):   [Code·확정·RICE 9.6]  install crash on --with-* flag (failing test attached)
+            [Security·확정·RICE 8.1] secret in committed config — moves Input "install success"
+M2:         [UX·확정·RICE 6.4]   wizard step skips a category — moves Input "wizard completion"
+M3:         [Scale·확정·RICE 4.2] category list hardcoded in 2 places → derive
+M4:         [Promotion·확정·RICE 3.5] README over-claims a feature (Working-Backwards gap)
+Parked:     [Security·미검증·conf 50%] suspected authz gap — needs reproduction before action
+Rejected:   8 findings (rubric-match without real defect / already handled)
+            (the run stopped at M4 — milestone count follows the findings, it is not a fixed five)
+```
+Every M-item: dimension-tagged, verdict-graded, North-Star-linked, RICE-ranked.
+## Cross-references (don't duplicate — hand off)
+- **UX dimension** can spawn the multi-persona UX review skill (`multi-persona-review`) for
+  deeper persona-based heuristic inspection instead of a single UX finder.
+- **Gap findings** (built vs promised, missing E2E coverage) hand off to `gap-analysis-e2e`
+  rather than being re-derived here.
+- **The roadmap output** feeds `northstar-roadmap`, which owns milestone sequencing and
+  North-Star input modeling in depth.
+- For repo discipline this skill enforces: `.claude/rules/no-false-ship.md` (evidence matrix,
+  confirmed/unverified/rejected) and `.claude/rules/gates-taxonomy.md` (Revision-loop cap,
+  Escalation on contested findings).
+## Progressive disclosure
+This SKILL.md is the operating manual. If per-dimension rubrics need to grow (e.g. a full
+Nielsen severity scale, or a language-specific code-review checklist), put them in a
+`reference/` file beside this one and link it here — keep this file lean. The extensibility
+contract stays: a new dimension = a named rubric + its own independent verifier, nothing else.