npm - @uzysjung/agent-harness - Versions diffs - 26.85.0 → 26.87.0 - Mend

@uzysjung/agent-harness 26.85.0 → 26.87.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/README.ko.md +2 -2
package/README.md +30 -6
package/dist/{chunk-SPZ63E7P.js → chunk-QHYH6P32.js} +115 -1
package/dist/chunk-QHYH6P32.js.map +1 -0
package/dist/index.js +139 -12
package/dist/index.js.map +1 -1
package/dist/trust-tier-drift.js +1 -1
package/package.json +2 -2
package/templates/skills/asis-tobe-decision/SKILL.md +161 -0
package/templates/skills/compaction-handoff/SKILL.md +178 -0
package/templates/skills/gap-analysis-e2e/SKILL.md +239 -0
package/templates/skills/multi-persona-review/SKILL.md +209 -0
package/templates/skills/northstar-roadmap/SKILL.md +176 -0
package/templates/skills/ultracode-service-audit/SKILL.md +222 -0
package/dist/chunk-SPZ63E7P.js.map +0 -1

package/dist/trust-tier-drift.js CHANGED Viewed

@@ -4,7 +4,7 @@ import {
   EXTERNAL_ASSETS,
   TRUST_TIER,
   init_esm_shims
-} from "./chunk-SPZ63E7P.js";
+} from "./chunk-QHYH6P32.js";
 // src/trust-tier-drift.ts
 init_esm_shims();

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "@uzysjung/agent-harness",
-  "version": "26.85.0",
-  "description": "Curate vetted AI-coding workflows by your tech stack — install only what you need, across Claude Code, Codex, OpenCode & Antigravity",
+  "version": "26.87.0",
+  "description": "Curate vetted AI-coding skills & plugins by your tech stack — install only what you need, across Claude Code, Codex, OpenCode & Antigravity",
   "type": "module",
   "publishConfig": {
     "access": "public"

package/templates/skills/asis-tobe-decision/SKILL.md ADDED Viewed

@@ -0,0 +1,161 @@
+---
+name: asis-tobe-decision
+description: >-
+  Present a decision or confirmation request in the user's four-part format:
+  전후맥락 (context) → 추천 + 이유 (recommendation) → UI/UX 형태 (a scannable
+  table/option-list) → ASIS→TOBE contrast, led by the recommendation so the user
+  can say yes fast. Fire at genuine A-or-B / approval / explicit-ASIS-TOBE moments.
+  Triggers on the user's verbatim phrases "ASIS TOBE로 설명", "ASIS-TOBE로 알려줘",
+  "화면으로 ASIS TOBE로 설명", "의사결정 / 컨펌 요청", "이거 진행할까요?", and the
+  softer "다음 진행할 것들 알려줘"; English equivalents: "present this as ASIS/TOBE",
+  "give me the as-is to-be", "should I do A or B", "ask for my approval", "lay out
+  the options". Do NOT fire for pure information with no decision, or trivial
+  reversible actions you would just do.
+---
+# ASIS→TOBE Decision Presentation
+A presentation **format**, not a heavy process. It fires at the moment you would
+otherwise ask "should I do this?" or "A or B?" and turns that question into a
+one-screen artifact the user can approve, question, or push back on in a single
+pass.
+This codifies the user's standing rule in `.claude/CLAUDE.md`:
+> **의사결정 및 컨펌 요청 시**
+> 1. 이해가 쉽도록 전후 맥락과 함께 상세하게 설명
+> 2. 추천하는 제안과 그 이유를 설명
+> 3. UI/UX 형태로 이해할 수 있도록 설명
+> 4. ASIS TOBE 형태로 설명
+The rule exists but does not reliably self-trigger — the user has had to demand it
+mid-task ("구체적으로 화면으로 ASIS TOBE로 설명", "다음 진행할 것들 ASIS-TOBE로
+알려줘"). This skill makes it an **active default** so the user never has to ask.
+## When to use
+Fire whenever you are about to:
+- ask for approval before doing something ("이거 이렇게 진행할까요?")
+- offer the user a choice between two or more approaches
+- propose a change to architecture, config, scope, or plan
+- recommend one option over others
+Softer/secondary trigger: reporting "next steps" the user must sign off on ("다음
+진행할 것들"). Use judgment — render it in this format when a real yes/no is needed.
+Do **not** fire for pure information with no decision in it, or for trivial
+reversible actions you'd just do. This is for moments where a human "yes / no /
+which one" is genuinely needed.
+## The four slots (= the four CLAUDE.md items)
+The four slots map 1:1 to the user's rule. Cover all four, but **lead with the
+recommendation** — readers decide off the conclusion, so put it up top even though
+it is item 2 in the list. Everything after it is there to justify and let the user
+react to one specific cell.
+```
+추천 + 이유   (item 2)  ← lead here: the recommendation and the explicit ask
+전후맥락       (item 1)  ← context: the forces that make this decision necessary now
+UI/UX 형태     (item 3)  ← one scannable table / option-list, never prose
+ASIS→TOBE     (item 4)  ← current → proposed contrast, gap made concrete
+```
+### 추천 + 이유 (item 2) — lead with this
+State the recommendation as a **concrete commitment in active voice**: "I'll switch
+X to Y" — not "we could consider maybe looking at Y." Hedged language signals low
+confidence and pushes the user back into doing the analysis themselves. Give the
+short **why** (a line or two), then the **explicit ask**: "Approve A, or pick B?" A
+proposal with no actual ask leaves the loop open and guarantees another round.
+### 전후맥락 (item 1) — context
+The **forces at play** in plain language — the technical, product, or constraint
+pressure that makes this decision necessary now. Keep it to what the user needs to
+judge the call; without it a reader either blindly accepts or blindly rejects.
+### UI/UX 형태 (item 3) — show it, don't narrate it
+Render the choice as **one scannable table or option list, never a wall of prose** —
+this is the "UI/UX 형태로 이해할 수 있도록" item. Aligned columns the user can scan
+vertically; pre-answer the obvious objection inline ("왜 B가 아닌가") so they don't
+have to ask. When it's "which of these?", an option-comparison table fits; when it's
+"move from current to proposed?", use the ASIS→TOBE shape below.
+### ASIS→TOBE (item 4) — make the gap concrete
+Columns: *항목 / ASIS (현재) / TOBE (제안) / Gap*. **Quantify the gap** with a metric
+or cost — an unquantified gap ("느림 → 빨라짐") is rhetoric, not a basis for
+deciding. List trade-offs honestly, including the downside of the recommended
+option; hidden downsides surface later as distrust (a `no-false-ship` violation).
+You can close with a one-line tail of what happens on approval ("승인 시 → …") — keep
+it to the ask, not a separate required slot.
+## Worked example
+**Input** (what you're tempted to send): *"The compat checker re-validates every
+asset on each run and it's gotten slow. I could cache results. Want me to?"*
+**Output** (the format):
+> **추천 + 이유:** `gen-compatibility` 의 검증 결과를 콘텐츠 해시 기준으로 캐싱하도록
+> 바꾸겠습니다 — 변경분만 재검증해 시간을 가장 크게 줄이고 CI 코어 변동에도 안정적이라.
+> **승인해 주세요 — 아니면 B(캐시 없이 병렬화만)도 가능합니다.**
+>
+> **전후맥락:** 자산이 52개로 늘면서 매 실행마다 전 자산 재검증 → CI `gen-compat`
+> 스텝이 느려짐. 자산 내용은 거의 안 바뀌는데 매번 풀 재검증하는 게 병목.
+>
+> | 항목 | ASIS (현재) | TOBE (제안: 해시 캐싱) | Gap |
+> |------|-------------|------------------------|-----|
+> | 전체 검증 시간 | ~140s (52자산 전수) | ~12s (변경분만) | **-128s / 실행** |
+> | 캐시 미스 시 | — | 풀 재검증 폴백 | 정확성 손실 0 |
+> | 추가 코드 | 없음 | 캐시 read/write + 무효화 (~40줄) | 유지보수 +α |
+> | 트레이드오프 | — | 해시 충돌 시 stale 위험 | 무효화 키 = 파일내용 해시로 차단 |
+>
+> *왜 B(병렬화만)가 아닌가:* 병렬화는 시간을 ~3배만 줄이고(약 45s) 코어 수에 의존 —
+> 캐싱이 변경분만 검증해 더 크게(11배) 줄이고 CI 코어 변동에도 안정적이라 A 추천.
+>
+> 승인 시 → feature 브랜치에서 캐시 레이어 + 무효화 테스트 추가 → 로컬 `npm run ci`
+> 통과 → PR.
+Notice: the user can say "go" after the first two lines, or push back on exactly one
+cell of the table. No essay, no buried recommendation, no hidden downside.
+## Common failure modes to avoid
+- **Hedged recommendation** ("고려해볼 수 있습니다") — forces the user to do the
+  analysis. State a commitment.
+- **Recommendation buried under option analysis** — lead with the lead option.
+- **Context but no ask** — leaves the loop open. Always end the 추천 with the ask.
+- **Prose instead of a table** — skips the "UI/UX 형태" item. Show it scannable.
+- **Unquantified ASIS→TOBE gap** — rhetoric, not a decision basis. Attach a
+  metric or cost.
+- **Suppressed downsides** to make the proposal look cleaner — surfaces later as
+  distrust (and a `no-false-ship` violation). List trade-offs.
+## 왜 통하나 (근거, 선택적)
+Light support if you want to defend the format — not part of the body:
+- **Lead with the recommendation**: BLUF (Bottom Line Up Front) — readers decide off
+  the conclusion, so the ask goes first even though it's item 2.
+- **Keep supporting reasons few (~3), one scannable table**: executive-decision-slide
+  and working-memory (~7±2) guidance — more dilutes and pushes the reader back into
+  analysis-mode.
+- **Quantify the gap, list all consequences**: As-Is/To-Be gap analysis + ADR
+  (Nygard) — an unquantified or one-sided gap isn't a basis for deciding.
+- **Pre-answer the objection inline**: Amazon Working Backwards PR/FAQ — collapses
+  the back-and-forth into one round.
+- **Contested call? score on explicit criteria**: RICE (Intercom) — a small scoring
+  table is more defensible than prose advocacy.
+## Related skills
+A cross-cutting **presentation discipline**, not a workflow. Sibling skills that
+produce findings or choices should render them in this format:
+- `gap-analysis-e2e` — its gap output maps directly onto the ASIS→TOBE table.
+- `ultracode-service-audit` — present audit findings and remediation choices as
+  ASIS→TOBE.

package/templates/skills/compaction-handoff/SKILL.md ADDED Viewed

@@ -0,0 +1,178 @@
+---
+name: compaction-handoff
+description: >-
+  Execute a structured handoff right before a context compaction so no state is lost — persist
+  durable facts to memory, take an atomic git snapshot (clean tree + open-PR check), and emit a
+  fixed-field resume anchor (current state / verified / what's left / next action) plus a suggested
+  custom /compact summary line. Use when the user says "컴팩션 준비해줘", "컴팩션하고 이어서
+  진행할 수 있게 준비해줘", "핸드오프 준비해줘", or in English "prepare for compaction",
+  "get ready to compact and continue", "hand off before /compact", "checkpoint before compacting".
+  Also fire proactively when context is nearing the window limit and an auto-compact is imminent.
+---
+# Compaction Handoff Protocol
+A context window is about to be summarized and reinitialized. The model is stateless, so **only
+what you write out survives** — the new window starts from a lossy summary, not the live history.
+This skill runs the handoff deliberately so the resumed session can pick up without re-deriving
+lost state.
+> Sibling skill `strategic-compact` decides **WHEN** to compact (phase boundaries, token pressure).
+> This skill is the **HOW**: it executes the handoff at that moment. Run `strategic-compact`'s
+> decision first if you're unsure whether to compact at all; run this when the answer is "yes".
+## When to use
+The user works in Korean and triggers this repeatedly. Treat any of these as a fire signal:
+- **"컴팩션 준비해줘"** / **"컴팩션하고 이어서 진행할 수 있게 준비해줘"** / **"핸드오프 준비해줘"**
+- English equivalents: "prepare for compaction", "get ready to compact and continue", "hand off".
+- **Proactively, before the cliff.** Don't wait for auto-compact at 100%. Auto-compaction fires
+  near the window limit; a common practice is to trigger the handoff earlier (around ~80% of the
+  window) so there's still budget to write a clean anchor — a rushed handoff at the cliff is exactly
+  where load-bearing context gets dropped.
+Why proactive matters: auto-compact optimizes for *generic* continuity, not for THIS task's
+load-bearing facts (an open PR, a chosen-but-unwritten decision). If you skip the handoff because
+the task "looks finished," you risk silent information loss — the resumed agent assumes work
+shipped that did not.
+## The model: three legs of a checkpoint
+Treat the handoff as a deliberate checkpoint, not a passive summary. Three legs, each grounded in
+an established practice:
+1. **Persist durable facts to external memory** *(before compacting, never after)*.
+   The window will be wiped; structured note-taking exists precisely because the summary alone
+   can't be trusted to carry everything. Route load-bearing *decisions* to ADRs (`docs/decisions/`)
+   so the *why* survives every future compaction, not just this one.
+2. **Take an atomic, reconstructible snapshot** — git clean (or a `savepoint` commit) plus an
+   **open-PR check**. A dirty/half-committed tree is a corrupt checkpoint; an open PR is itself
+   critical state that belongs in "what's left." This is the Checkpoint **Atomicity** and **State
+   Completeness** principles applied to the working tree.
+3. **Emit a fixed-field resume anchor** — current state / verified / what's left / next action —
+   framed *working backwards* from the next concrete step so a freshly-compacted agent re-orients
+   instantly instead of replaying history forward.
+### Preserve-list, not a prose blob
+Use an explicit preserve-list rather than a free-form paragraph (Anthropic's stated preserve/discard
+split):
+| Preserve (high-value) | Discard (low-value) |
+|---|---|
+| Architectural decisions + their *why* (link ADR) | Redundant tool outputs, raw logs |
+| Unresolved bugs / blockers | Intermediate reasoning already acted on |
+| Processed-vs-remaining boundary | File contents you can re-read from disk |
+| Open PRs / unpushed branches | Tool-call counts and history |
+| Verified-vs-merely-claimed evidence | Restated CLAUDE.md / rules (already loaded) |
+## Workflow
+Run these in order. Each writes durable state *before* the window is touched.
+**1. Memory — persist durable facts.**
+Update the auto-memory (`MEMORY.md`) and any session-summary entry with the preserve-list items.
+For a load-bearing decision (architecture, dependency, data model, breaking change), write or update
+an ADR in `docs/decisions/` — this is the one place "the why behind the constraint" survives lossy
+compaction. Don't rely on the `/compact` summary to carry a decision; it strips provenance.
+**2. Git — atomic snapshot + open-PR check.**
+Make the working tree reflect a consistent state. The `gh pr list` step is **not optional** — it is
+the git-policy **Session Cleanup** gate, which is mandatory before any `/clear` or `/compact`. Run it
+every handoff:
+```bash
+git status --short                 # is the tree clean?
+# IF tree is dirty AND the work is worth keeping:
+git add -A && git commit -m "chore: savepoint before compaction handoff"
+gh pr list --state open            # list open PRs — each open PR is what's-left state to surface, not necessarily an anomaly
+git branch --show-current          # note unpushed branch state
+```
+This folds the git-policy **Session Cleanup** gate into the handoff. An open PR is not a
+loose end to hide — surface it in the anchor's "what's left" with its number, CI status, and
+mergeability, and let the user decide (no auto-merge).
+**3. Resume anchor — four fixed fields.**
+Emit the anchor. Keep "verified" distinct from "done": *done* is a claim, *verified* encodes the
+evidence (test PASS output, exit 0, a merged PR). The resumed agent re-trusts only what's verified.
+Make **next action** mandatory and singular — it's the entry point the resumed session executes
+first, so the anchor is self-serve (Recovery Automation), not a note to re-interpret.
+**4. Suggested /compact line.**
+Hand the user a one-line custom summary to pass to `/compact` (e.g.
+`/compact Resume: finish ADR-022 flag removal; PR #160 merged, branches drift test failing`). This
+is a pointer to the durable anchor, not a replacement for it.
+## Handoff report format
+Present a compact status table, then the anchor:
+```
+| Check          | Result                                  | Status |
+|----------------|-----------------------------------------|--------|
+| Memory         | MEMORY.md + ADR-022 updated             | ✓      |
+| Git            | tree clean, savepoint a1b2c3d           | ✓      |
+| Open PRs       | gh pr list → 0 open                      | ✓      |
+| Files          | resume anchor written, plan SSOT synced | ✓      |
+```
+## Worked example
+**Input** (user, at ~78% context):
+> 컴팩션하고 이어서 진행할 수 있게 준비해줘
+**Output:**
+Handoff checks:
+| Check    | Result                                          | Status |
+|----------|-------------------------------------------------|--------|
+| Memory   | MEMORY.md code-quality-cycle entry refreshed    | ✓      |
+| Git      | tree clean (no savepoint needed)                | ✓      |
+| Open PRs | `gh pr list` → **1 open: #165** CI pass, mergeable | ⚠ surfaced |
+| Files    | resume anchor below; persona-feedback SSOT synced | ✓      |
+**Resume anchor**
+- **Current state:** Persona P0 fixes in progress on branch `fix/persona-p0`. 2 of 4 fixes landed.
+- **Verified:** `npm run ci` exit 0 on current HEAD (typecheck + lint + coverage branches 88.4% + build). Docker scenario for `--with <id>` exit 0.
+- **What's left:** Fixes 3 & 4 (wizard copy + COMPATIBILITY honesty). **Open PR #165 is not merged** — publishing is blocked until P0 complete (per persona-feedback-improvements.md). Branch `fix/persona-p0` unpushed for the last 2 commits.
+- **Next action:** Implement fix 3 (wizard understanding-category copy), then re-run `npm run ci`.
+**Suggested /compact line:**
+`/compact Resume fix/persona-p0: do fix 3 (wizard copy) next, then ci; PR #165 open & blocks publish until all 4 P0 done`
+Notice what the anchor does deliberately: it leads with the **next action** (working-backwards), keeps
+the open PR visible in "what's left" instead of assuming it shipped, and separates the *verified* CI
+evidence from the *unverified* remaining fixes.
+## Pitfalls this guards against
+- **Lossy summary-stripping** — a headline-only handoff looks complete but is unrecoverable because
+  the evidence and reasoning chains are gone. Fix: preserve-list + ADR provenance, not a prose blob.
+- **Lost-in-the-Middle over-stuffing** — dumping full history into the new window doesn't help;
+  buried facts follow a U-shaped accuracy curve. The anchor stays short and structured.
+- **Telephone-game decay** — fidelity erodes across successive compactions. Routing load-bearing
+  facts to stable memory/ADRs breaks the degradation chain a chain of summaries can't.
+- **Non-atomic snapshot** — handing off with a half-committed tree reconstructs an inconsistent
+  state. The git leg forces a clean or savepoint-committed tree.
+- **Incomplete state save** — omitting the processed-vs-remaining boundary forces duplicate work or
+  silent re-execution of irreversible actions. The "what's left" field is mandatory.
+## References
+- Anthropic — *Effective context engineering for AI agents* (Compaction; Structured Note-Taking /
+  Agentic Memory): https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
+- Hendricks.ai — Checkpoint patterns (State Completeness, Atomicity, Recovery Automation):
+  https://hendricks.ai/insights/checkpoint-patterns-long-running-ai-agent-tasks
+- XTrace — AI agent context handoff (Decisions / Artifacts / Preferences / Timeline; decision
+  provenance): https://xtrace.ai/blog/ai-agent-context-handoff
+- Architectural Decision Records (ADR) — append-only single-decision records; route load-bearing
+  decisions here so the *why* survives compaction.
+- Amazon *Working Backwards* — lead the anchor with the end state and the single next action.
+## Related skills
+- **strategic-compact** — decides WHEN to compact (phase boundaries, token pressure). Pair with it:
+  it answers "compact now?", this skill executes the handoff when the answer is yes.
+- **git-policy Session Cleanup** — the open-PR check (leg 2) is the same gate; this skill folds it
+  into the pre-compaction moment.

package/templates/skills/gap-analysis-e2e/SKILL.md ADDED Viewed

@@ -0,0 +1,239 @@
+---
+name: gap-analysis-e2e
+description: >-
+  Two-mode chained gap analysis. DETECT scans the current service end-to-end
+  through three lenses — north-star alignment, correctness (bugs), and
+  user-perspective (UX) — and enumerates concrete, severity-ranked gaps. Then
+  BENCHMARK researches how reference/benchmark services actually solved each
+  high-ranked gap and PROPOSES a closing approach. Use when the user says any of:
+  "북극성 기준으로 부족한 점", "사용자 관점에서 부족한 점", "다른 벤치마크 서비스는 이 부분을
+  어떻게 해결했는지", "갭분석", "레퍼런스 서비스랑 비교해서 부족한 점 찾아줘" — or the
+  English equivalents: "gap analysis", "what are we missing vs the
+  ideal/north-star", "benchmark against reference services". Fires for both
+  Korean and English phrasing.
+---
+# Gap Analysis E2E (reverse + competitive)
+A targeted loop, not a sweep. You take the service as it is today, find where it
+falls short of the ideal **and** where it is simply broken or awkward, then for
+each real shortfall you go look at how a benchmark service solved that exact
+problem before proposing a fix. The user's own framing:
+> "현재 서비스 개발 상태에서 북극성 기준으로 부족하거나, 버그가 있거나, 사용자 관점에서 부족한 점을
+> 인지하면 다른 벤치마크 서비스는 이부분을 어떻게 해결했는지를 확인해 가는 거지."
+So the work fuses two moves that are usually done separately: **reverse-gap**
+(distance from the north-star / ideal) and **competitive benchmark resolution**
+(how others closed the same gap). DETECT finds gaps; BENCHMARK closes them. They
+chain: detect → for each gap, benchmark → propose fix.
+## When to use
+- You have a working-ish service and want to know, concretely, where it's behind
+  its own north star, where it has bugs, and where the UX disappoints.
+- You've found a weak spot and want "다른 유사 깃허브 프로젝트 / 레퍼런스 SaaS 는 이걸
+  어떻게 했지?" before inventing a fix.
+- You want a ranked, auditable list of gap → benchmark evidence → proposed close,
+  not a vague "we could improve X."
+Not for: directing the roadmap forward (that's `northstar-roadmap` — it DIRECTS;
+this DETECTS gaps against the same north star). Not a full N-dimension audit
+(that's `ultracode-service-audit`; this is a narrower gap-to-benchmark loop).
+---
+## MODE 1 — DETECT
+Run three **independent** passes, then consolidate. The usability and gap-analysis
+literature is unanimous that one undifferentiated pass systematically under-finds:
+heuristic evaluation works precisely because several evaluators inspect separately
+and you aggregate (Nielsen & Molich). Blend the lenses into one sweep and you
+will miss large categories of gap. So scan each lens on its own terms, then merge.
+A gap is only valid if it is a **concrete delta between two describable states** —
+the observable current state and a specific ideal state. "It feels unpolished" is
+an opinion; "the onboarding has no empty-state for zero projects, the ideal is a
+guided first-run" is a gap. (Gap Analysis: Current → Future State.)
+### Lens A — North-star alignment (the reverse / planning lens)
+For the ideal state, use a **Working-Backwards** artifact: write (or read, if it
+exists in `docs/NORTH_STAR.md`) the one-paragraph press release of the finished,
+ideal product, then reason backward. The gap is the distance between today's
+product and that press release (Amazon PR-FAQ). Then make it testable with the
+**North Star Framework**: is each surface tied to a north-star *input* metric? Two
+gap shapes fall out automatically:
+- an input lever that should move the north star but doesn't, and
+- product surface area that contributes to **no** input (candidate for removal).
+Where this repo's north star lives: `docs/NORTH_STAR.md` and `CLAUDE.md`
+("설치 서비스 = installer + curator"). Judge surfaces against *that*, not taste.
+No `docs/NORTH_STAR.md`? Don't skip Lens A — write the one-paragraph
+Working-Backwards press release *inline* from the README / `CLAUDE.md` vision
+first, then score against it. The ideal state is the anchor; an absent file is no
+excuse to drop the planning lens.
+### Lens B — Correctness (the bug lens)
+Inspect for things that are simply wrong: broken flows, crashes, mismatched
+advertised-vs-actual behavior, drift between docs and code. In this repo the
+`no-false-ship` rule names the exact failure family — a `--with-*` flag that's
+advertised but unregistered, a `--version` that lies, a category missing from the
+wizard. Treat each as a correctness gap with a reproduction, not a hunch.
+### Lens C — User-perspective (the UX lens)
+Judge the interface against **named criteria**, not vibes — Nielsen's 10
+heuristics (visibility of system status, match to the real world, error
+prevention, recognition over recall, etc.) so each finding traces to a principle
+and is reproducible. For the heavy UX pass, hand this lens to the
+**`multi-persona-review`** skill (independent persona evaluators) rather than
+duplicating its machinery here. Remember the limits: heuristic inspection finds
+roughly half of what real user testing finds and produces false positives — it's
+a cheap first filter, not ground truth.
+### Score every gap before you spend benchmark effort
+Never present an unranked gap list — the benchmark research in Mode 2 is the
+expensive part, so it must run only on gaps that matter. Tag each gap with:
+- **Severity 0–4** (Nielsen): roughly frequency × impact × persistence. 0 = not
+  really a problem, 4 = catastrophe, must fix before release.
+- **Opportunity (optional, ODI)**: `Importance + max(Importance − Satisfaction, 0)`
+  (importance weighted twice; Ulwick). High-importance/low-satisfaction =
+  under-served, prime target. Low-importance/high-satisfaction = **over-served** —
+  flag it for *removal/simplification*, not addition. Surfacing over-served areas
+  is the structural antidote to feature bloat; a good scan proposes cuts too.
+DETECT is fully usable on **severity 0–4 alone**. ODI needs real importance and
+satisfaction data; for a solo/tooling repo without it, *skip* the Opp. column
+rather than inventing importance/satisfaction numbers — fabricated inputs launder a
+guess as data. Reach for ODI only when you genuinely have user-sourced signal.
+Keep the numbers as a prioritization aid, not proof — self-reported importance and
+made-up severity launder a guess as data if you over-trust them.
+**DETECT output** — one table:
+| # | Lens | Gap (current → ideal delta) | Severity 0–4 | Opp. | Notes / repro |
+|---|------|------------------------------|--------------|------|---------------|
+Scale the rigor to severity: a 4 earns the full reverse-from-ideal write-up; a 1
+gets a one-line pre-flight note. Don't run the heavy PR-FAQ ritual on every tiny
+gap — that's analysis paralysis.
+---
+## MODE 2 — BENCHMARK (runs only on high-ranked gaps)
+For each gap worth closing, work like a **competitive teardown**: take apart how a
+reference service *actually* solves that exact problem and document the **verified
+mechanism** — the real flow, states, and copy you observed — not the assumed
+implementation. This mirrors `no-false-ship`: claim only what you inspected. If you
+couldn't verify how they do it, **say so** ("could not inspect — inferred") rather
+than fabricating a plausible-sounding mechanism. Fictional evidence is the named
+failure mode of both Working-Backwards and this skill.
+Sources, in order of trust: the running reference product / its repo (first-hand),
+then docs, then write-ups. For "다른 유사 깃허브 프로젝트 보고 수정", read their actual
+code path, not their README claims.
+Then **PROPOSE** the closing approach in **jobs-to-be-done** terms — what job does
+the user need done — and consciously resist the **feature-parity trap**. Copying a
+competitor's feature list is a catch-up trap that breeds bloat (Zune out-featured
+the iPod and lost; customers wanted the job done, not the features). For each gap,
+decide explicitly: does closing it defend table-stakes, or does a *differentiated*
+approach make the competitor's solution irrelevant? Propose accordingly.
+Record each proposed fix **ADR-style** — rationale + the rejected benchmark
+alternative — so the whole chain is auditable. (This repo already has an
+`architecture-decision-record` convention and `docs/decisions/`.)
+**BENCHMARK output** — per high-ranked gap:
+```
+Gap #N (sev X): <one line>
+  Benchmark:   <service> — VERIFIED how they solve it: <real flow/state/copy>
+               [or: COULD NOT INSPECT — inferred, treat as hypothesis]
+  Job:         <the customer job this gap blocks>
+  Proposed:    <closing approach in JTBD terms — differentiate, don't mirror>
+  Rejected:    <the benchmark's exact approach, and why not, if diverging>
+```
+---
+## The chain, in order
+1. **Define states.** Current (observable) + ideal (Working-Backwards press
+   release, anchored to `docs/NORTH_STAR.md`). A gap is the delta between them.
+2. **DETECT** — three independent passes (north-star / correctness / UX via
+   `multi-persona-review`), each against named criteria.
+3. **Consolidate & score** — merge into one table; severity 0–4 + optional ODI
+   opportunity; tag over-served items for removal.
+4. **BENCHMARK** — only the high-ranked gaps; verified teardown of how a reference
+   service solves each; mark anything unverified.
+5. **PROPOSE** — closing approach in JTBD terms, differentiate over parity-match,
+   recorded ADR-style with the rejected alternative.
+---
+## Worked example (Input → Output)
+**Input:** "이 설치 서비스 갭분석 해줘 — 북극성 대비 부족한 점이랑 버그랑 UX, 그리고 다른
+벤치마크는 어떻게 했는지."
+**DETECT (consolidated, abridged):**
+| # | Lens | Gap (current → ideal) | Sev | Opp | Notes |
+|---|------|------------------------|-----|-----|-------|
+| 1 | North-star | Wizard lists assets but never explains *why* each is vetted; north star is "이해하고 선택", so an unexplained list under-serves the core job | 3 | 14 | no provenance/★ shown at select time |
+| 2 | Correctness | `--with-foo` advertised in README but crashes (flag unregistered) | 4 | — | repro: `install --with-foo` → CAC throw |
+| 3 | UX | First run gives no "what happens next" status (Nielsen: visibility of system status) | 3 | 11 | via multi-persona-review |
+| 4 | North-star (over-served) | Three near-duplicate verbose `--help` walls; low importance, high satisfaction | 1 | 2 | candidate for **removal** |
+**BENCHMARK (gap #1, high-ranked):**
+```
+Gap #1 (sev 3): wizard shows assets with no "why vetted" at decision time
+  Benchmark:   VS Code Marketplace — VERIFIED: each extension card shows
+               install count + verified-publisher badge + star rating inline
+               in the pick list, so the trust signal sits at the moment of choice.
+  Job:         "I need to trust this asset enough to install it, right here."
+  Proposed:    Inline a one-line provenance (source repo + ★ + 'vetted: <date>')
+               on each wizard row — surface the trust signal at decision time.
+               Differentiator: we curate, so add a one-line *curator reason*,
+               which a raw marketplace can't.
+  Rejected:    Marketplace's full detail-page-per-extension — too heavy for a
+               terminal wizard; defers the decision instead of supporting it.
+```
+The gap #1 proposal lands as an ADR — e.g. `docs/decisions/ADR-0NN-wizard-provenance.md`
+recording the inline-provenance decision and the rejected full-detail-page
+alternative — so step 5's "record ADR-style" is concrete, not just advice.
+Gap #2 (correctness, sev 4) skips benchmarking — it's a bug, fix directly and add
+the drift guard `no-false-ship` requires. Gap #4 proposes deletion, not a
+benchmark. That selective routing is the point: spend research only where it pays.
+---
+## Cross-references (don't duplicate)
+- **`multi-persona-review`** — owns the UX lens (Lens C). Invoke it; don't
+  re-implement persona evaluation here.
+- **`northstar-roadmap`** — same north star, opposite direction: it *directs* the
+  roadmap forward; this *detects* gaps against it.
+- **`ultracode-service-audit`** — the full N-dimension sweep. This skill is the
+  narrower, faster gap → benchmark loop when you don't need the whole audit.
+- **`architecture-decision-record`** — record each proposed fix as an ADR.
+## Notes on rigor (where deeper detail would live)
+If a future version needs the full scoring rubrics (the complete Nielsen 10-item
+checklist text, the ODI questionnaire wording) or per-domain benchmark source
+lists, the option is to summarize here and split the long-form into a sibling
+`reference.md` — no such file exists yet, and this SKILL.md is self-sufficient
+without it. Keep SKILL.md the practical map, not the encyclopedia.