@uzysjung/agent-harness 26.86.0 → 26.88.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -4,7 +4,7 @@ import {
4
4
  EXTERNAL_ASSETS,
5
5
  TRUST_TIER,
6
6
  init_esm_shims
7
- } from "./chunk-EKLV22W3.js";
7
+ } from "./chunk-QHYH6P32.js";
8
8
 
9
9
  // src/trust-tier-drift.ts
10
10
  init_esm_shims();
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@uzysjung/agent-harness",
3
- "version": "26.86.0",
3
+ "version": "26.88.0",
4
4
  "description": "Curate vetted AI-coding skills & plugins by your tech stack — install only what you need, across Claude Code, Codex, OpenCode & Antigravity",
5
5
  "type": "module",
6
6
  "publishConfig": {
@@ -0,0 +1,161 @@
1
+ ---
2
+ name: asis-tobe-decision
3
+ description: >-
4
+ Present a decision or confirmation request in the user's four-part format:
5
+ 전후맥락 (context) → 추천 + 이유 (recommendation) → UI/UX 형태 (a scannable
6
+ table/option-list) → ASIS→TOBE contrast, led by the recommendation so the user
7
+ can say yes fast. Fire at genuine A-or-B / approval / explicit-ASIS-TOBE moments.
8
+ Triggers on the user's verbatim phrases "ASIS TOBE로 설명", "ASIS-TOBE로 알려줘",
9
+ "화면으로 ASIS TOBE로 설명", "의사결정 / 컨펌 요청", "이거 진행할까요?", and the
10
+ softer "다음 진행할 것들 알려줘"; English equivalents: "present this as ASIS/TOBE",
11
+ "give me the as-is to-be", "should I do A or B", "ask for my approval", "lay out
12
+ the options". Do NOT fire for pure information with no decision, or trivial
13
+ reversible actions you would just do.
14
+ ---
15
+
16
+ # ASIS→TOBE Decision Presentation
17
+
18
+ A presentation **format**, not a heavy process. It fires at the moment you would
19
+ otherwise ask "should I do this?" or "A or B?" and turns that question into a
20
+ one-screen artifact the user can approve, question, or push back on in a single
21
+ pass.
22
+
23
+ This codifies the user's standing rule in `.claude/CLAUDE.md`:
24
+
25
+ > **의사결정 및 컨펌 요청 시**
26
+ > 1. 이해가 쉽도록 전후 맥락과 함께 상세하게 설명
27
+ > 2. 추천하는 제안과 그 이유를 설명
28
+ > 3. UI/UX 형태로 이해할 수 있도록 설명
29
+ > 4. ASIS TOBE 형태로 설명
30
+
31
+ The rule exists but does not reliably self-trigger — the user has had to demand it
32
+ mid-task ("구체적으로 화면으로 ASIS TOBE로 설명", "다음 진행할 것들 ASIS-TOBE로
33
+ 알려줘"). This skill makes it an **active default** so the user never has to ask.
34
+
35
+ ## When to use
36
+
37
+ Fire whenever you are about to:
38
+
39
+ - ask for approval before doing something ("이거 이렇게 진행할까요?")
40
+ - offer the user a choice between two or more approaches
41
+ - propose a change to architecture, config, scope, or plan
42
+ - recommend one option over others
43
+
44
+ Softer/secondary trigger: reporting "next steps" the user must sign off on ("다음
45
+ 진행할 것들"). Use judgment — render it in this format when a real yes/no is needed.
46
+
47
+ Do **not** fire for pure information with no decision in it, or for trivial
48
+ reversible actions you'd just do. This is for moments where a human "yes / no /
49
+ which one" is genuinely needed.
50
+
51
+ ## The four slots (= the four CLAUDE.md items)
52
+
53
+ The four slots map 1:1 to the user's rule. Cover all four, but **lead with the
54
+ recommendation** — readers decide off the conclusion, so put it up top even though
55
+ it is item 2 in the list. Everything after it is there to justify and let the user
56
+ react to one specific cell.
57
+
58
+ ```
59
+ 추천 + 이유 (item 2) ← lead here: the recommendation and the explicit ask
60
+ 전후맥락 (item 1) ← context: the forces that make this decision necessary now
61
+ UI/UX 형태 (item 3) ← one scannable table / option-list, never prose
62
+ ASIS→TOBE (item 4) ← current → proposed contrast, gap made concrete
63
+ ```
64
+
65
+ ### 추천 + 이유 (item 2) — lead with this
66
+
67
+ State the recommendation as a **concrete commitment in active voice**: "I'll switch
68
+ X to Y" — not "we could consider maybe looking at Y." Hedged language signals low
69
+ confidence and pushes the user back into doing the analysis themselves. Give the
70
+ short **why** (a line or two), then the **explicit ask**: "Approve A, or pick B?" A
71
+ proposal with no actual ask leaves the loop open and guarantees another round.
72
+
73
+ ### 전후맥락 (item 1) — context
74
+
75
+ The **forces at play** in plain language — the technical, product, or constraint
76
+ pressure that makes this decision necessary now. Keep it to what the user needs to
77
+ judge the call; without it a reader either blindly accepts or blindly rejects.
78
+
79
+ ### UI/UX 형태 (item 3) — show it, don't narrate it
80
+
81
+ Render the choice as **one scannable table or option list, never a wall of prose** —
82
+ this is the "UI/UX 형태로 이해할 수 있도록" item. Aligned columns the user can scan
83
+ vertically; pre-answer the obvious objection inline ("왜 B가 아닌가") so they don't
84
+ have to ask. When it's "which of these?", an option-comparison table fits; when it's
85
+ "move from current to proposed?", use the ASIS→TOBE shape below.
86
+
87
+ ### ASIS→TOBE (item 4) — make the gap concrete
88
+
89
+ Columns: *항목 / ASIS (현재) / TOBE (제안) / Gap*. **Quantify the gap** with a metric
90
+ or cost — an unquantified gap ("느림 → 빨라짐") is rhetoric, not a basis for
91
+ deciding. List trade-offs honestly, including the downside of the recommended
92
+ option; hidden downsides surface later as distrust (a `no-false-ship` violation).
93
+ You can close with a one-line tail of what happens on approval ("승인 시 → …") — keep
94
+ it to the ask, not a separate required slot.
95
+
96
+ ## Worked example
97
+
98
+ **Input** (what you're tempted to send): *"The compat checker re-validates every
99
+ asset on each run and it's gotten slow. I could cache results. Want me to?"*
100
+
101
+ **Output** (the format):
102
+
103
+ > **추천 + 이유:** `gen-compatibility` 의 검증 결과를 콘텐츠 해시 기준으로 캐싱하도록
104
+ > 바꾸겠습니다 — 변경분만 재검증해 시간을 가장 크게 줄이고 CI 코어 변동에도 안정적이라.
105
+ > **승인해 주세요 — 아니면 B(캐시 없이 병렬화만)도 가능합니다.**
106
+ >
107
+ > **전후맥락:** 자산이 52개로 늘면서 매 실행마다 전 자산 재검증 → CI `gen-compat`
108
+ > 스텝이 느려짐. 자산 내용은 거의 안 바뀌는데 매번 풀 재검증하는 게 병목.
109
+ >
110
+ > | 항목 | ASIS (현재) | TOBE (제안: 해시 캐싱) | Gap |
111
+ > |------|-------------|------------------------|-----|
112
+ > | 전체 검증 시간 | ~140s (52자산 전수) | ~12s (변경분만) | **-128s / 실행** |
113
+ > | 캐시 미스 시 | — | 풀 재검증 폴백 | 정확성 손실 0 |
114
+ > | 추가 코드 | 없음 | 캐시 read/write + 무효화 (~40줄) | 유지보수 +α |
115
+ > | 트레이드오프 | — | 해시 충돌 시 stale 위험 | 무효화 키 = 파일내용 해시로 차단 |
116
+ >
117
+ > *왜 B(병렬화만)가 아닌가:* 병렬화는 시간을 ~3배만 줄이고(약 45s) 코어 수에 의존 —
118
+ > 캐싱이 변경분만 검증해 더 크게(11배) 줄이고 CI 코어 변동에도 안정적이라 A 추천.
119
+ >
120
+ > 승인 시 → feature 브랜치에서 캐시 레이어 + 무효화 테스트 추가 → 로컬 `npm run ci`
121
+ > 통과 → PR.
122
+
123
+ Notice: the user can say "go" after the first two lines, or push back on exactly one
124
+ cell of the table. No essay, no buried recommendation, no hidden downside.
125
+
126
+ ## Common failure modes to avoid
127
+
128
+ - **Hedged recommendation** ("고려해볼 수 있습니다") — forces the user to do the
129
+ analysis. State a commitment.
130
+ - **Recommendation buried under option analysis** — lead with the lead option.
131
+ - **Context but no ask** — leaves the loop open. Always end the 추천 with the ask.
132
+ - **Prose instead of a table** — skips the "UI/UX 형태" item. Show it scannable.
133
+ - **Unquantified ASIS→TOBE gap** — rhetoric, not a decision basis. Attach a
134
+ metric or cost.
135
+ - **Suppressed downsides** to make the proposal look cleaner — surfaces later as
136
+ distrust (and a `no-false-ship` violation). List trade-offs.
137
+
138
+ ## 왜 통하나 (근거, 선택적)
139
+
140
+ Light support if you want to defend the format — not part of the body:
141
+
142
+ - **Lead with the recommendation**: BLUF (Bottom Line Up Front) — readers decide off
143
+ the conclusion, so the ask goes first even though it's item 2.
144
+ - **Keep supporting reasons few (~3), one scannable table**: executive-decision-slide
145
+ and working-memory (~7±2) guidance — more dilutes and pushes the reader back into
146
+ analysis-mode.
147
+ - **Quantify the gap, list all consequences**: As-Is/To-Be gap analysis + ADR
148
+ (Nygard) — an unquantified or one-sided gap isn't a basis for deciding.
149
+ - **Pre-answer the objection inline**: Amazon Working Backwards PR/FAQ — collapses
150
+ the back-and-forth into one round.
151
+ - **Contested call? score on explicit criteria**: RICE (Intercom) — a small scoring
152
+ table is more defensible than prose advocacy.
153
+
154
+ ## Related skills
155
+
156
+ A cross-cutting **presentation discipline**, not a workflow. Sibling skills that
157
+ produce findings or choices should render them in this format:
158
+
159
+ - `gap-analysis-e2e` — its gap output maps directly onto the ASIS→TOBE table.
160
+ - `ultracode-service-audit` — present audit findings and remediation choices as
161
+ ASIS→TOBE.
@@ -0,0 +1,178 @@
1
+ ---
2
+ name: compaction-handoff
3
+ description: >-
4
+ Execute a structured handoff right before a context compaction so no state is lost — persist
5
+ durable facts to memory, take an atomic git snapshot (clean tree + open-PR check), and emit a
6
+ fixed-field resume anchor (current state / verified / what's left / next action) plus a suggested
7
+ custom /compact summary line. Use when the user says "컴팩션 준비해줘", "컴팩션하고 이어서
8
+ 진행할 수 있게 준비해줘", "핸드오프 준비해줘", or in English "prepare for compaction",
9
+ "get ready to compact and continue", "hand off before /compact", "checkpoint before compacting".
10
+ Also fire proactively when context is nearing the window limit and an auto-compact is imminent.
11
+ ---
12
+
13
+ # Compaction Handoff Protocol
14
+
15
+ A context window is about to be summarized and reinitialized. The model is stateless, so **only
16
+ what you write out survives** — the new window starts from a lossy summary, not the live history.
17
+ This skill runs the handoff deliberately so the resumed session can pick up without re-deriving
18
+ lost state.
19
+
20
+ > Sibling skill `strategic-compact` decides **WHEN** to compact (phase boundaries, token pressure).
21
+ > This skill is the **HOW**: it executes the handoff at that moment. Run `strategic-compact`'s
22
+ > decision first if you're unsure whether to compact at all; run this when the answer is "yes".
23
+
24
+ ## When to use
25
+
26
+ The user works in Korean and triggers this repeatedly. Treat any of these as a fire signal:
27
+
28
+ - **"컴팩션 준비해줘"** / **"컴팩션하고 이어서 진행할 수 있게 준비해줘"** / **"핸드오프 준비해줘"**
29
+ - English equivalents: "prepare for compaction", "get ready to compact and continue", "hand off".
30
+ - **Proactively, before the cliff.** Don't wait for auto-compact at 100%. Auto-compaction fires
31
+ near the window limit; a common practice is to trigger the handoff earlier (around ~80% of the
32
+ window) so there's still budget to write a clean anchor — a rushed handoff at the cliff is exactly
33
+ where load-bearing context gets dropped.
34
+
35
+ Why proactive matters: auto-compact optimizes for *generic* continuity, not for THIS task's
36
+ load-bearing facts (an open PR, a chosen-but-unwritten decision). If you skip the handoff because
37
+ the task "looks finished," you risk silent information loss — the resumed agent assumes work
38
+ shipped that did not.
39
+
40
+ ## The model: three legs of a checkpoint
41
+
42
+ Treat the handoff as a deliberate checkpoint, not a passive summary. Three legs, each grounded in
43
+ an established practice:
44
+
45
+ 1. **Persist durable facts to external memory** *(before compacting, never after)*.
46
+ The window will be wiped; structured note-taking exists precisely because the summary alone
47
+ can't be trusted to carry everything. Route load-bearing *decisions* to ADRs (`docs/decisions/`)
48
+ so the *why* survives every future compaction, not just this one.
49
+ 2. **Take an atomic, reconstructible snapshot** — git clean (or a `savepoint` commit) plus an
50
+ **open-PR check**. A dirty/half-committed tree is a corrupt checkpoint; an open PR is itself
51
+ critical state that belongs in "what's left." This is the Checkpoint **Atomicity** and **State
52
+ Completeness** principles applied to the working tree.
53
+ 3. **Emit a fixed-field resume anchor** — current state / verified / what's left / next action —
54
+ framed *working backwards* from the next concrete step so a freshly-compacted agent re-orients
55
+ instantly instead of replaying history forward.
56
+
57
+ ### Preserve-list, not a prose blob
58
+
59
+ Use an explicit preserve-list rather than a free-form paragraph (Anthropic's stated preserve/discard
60
+ split):
61
+
62
+ | Preserve (high-value) | Discard (low-value) |
63
+ |---|---|
64
+ | Architectural decisions + their *why* (link ADR) | Redundant tool outputs, raw logs |
65
+ | Unresolved bugs / blockers | Intermediate reasoning already acted on |
66
+ | Processed-vs-remaining boundary | File contents you can re-read from disk |
67
+ | Open PRs / unpushed branches | Tool-call counts and history |
68
+ | Verified-vs-merely-claimed evidence | Restated CLAUDE.md / rules (already loaded) |
69
+
70
+ ## Workflow
71
+
72
+ Run these in order. Each writes durable state *before* the window is touched.
73
+
74
+ **1. Memory — persist durable facts.**
75
+ Update the auto-memory (`MEMORY.md`) and any session-summary entry with the preserve-list items.
76
+ For a load-bearing decision (architecture, dependency, data model, breaking change), write or update
77
+ an ADR in `docs/decisions/` — this is the one place "the why behind the constraint" survives lossy
78
+ compaction. Don't rely on the `/compact` summary to carry a decision; it strips provenance.
79
+
80
+ **2. Git — atomic snapshot + open-PR check.**
81
+ Make the working tree reflect a consistent state. The `gh pr list` step is **not optional** — it is
82
+ the git-policy **Session Cleanup** gate, which is mandatory before any `/clear` or `/compact`. Run it
83
+ every handoff:
84
+ ```bash
85
+ git status --short # is the tree clean?
86
+ # IF tree is dirty AND the work is worth keeping:
87
+ git add -A && git commit -m "chore: savepoint before compaction handoff"
88
+ gh pr list --state open # list open PRs — each open PR is what's-left state to surface, not necessarily an anomaly
89
+ git branch --show-current # note unpushed branch state
90
+ ```
91
+ This folds the git-policy **Session Cleanup** gate into the handoff. An open PR is not a
92
+ loose end to hide — surface it in the anchor's "what's left" with its number, CI status, and
93
+ mergeability, and let the user decide (no auto-merge).
94
+
95
+ **3. Resume anchor — four fixed fields.**
96
+ Emit the anchor. Keep "verified" distinct from "done": *done* is a claim, *verified* encodes the
97
+ evidence (test PASS output, exit 0, a merged PR). The resumed agent re-trusts only what's verified.
98
+ Make **next action** mandatory and singular — it's the entry point the resumed session executes
99
+ first, so the anchor is self-serve (Recovery Automation), not a note to re-interpret.
100
+
101
+ **4. Suggested /compact line.**
102
+ Hand the user a one-line custom summary to pass to `/compact` (e.g.
103
+ `/compact Resume: finish ADR-022 flag removal; PR #160 merged, branches drift test failing`). This
104
+ is a pointer to the durable anchor, not a replacement for it.
105
+
106
+ ## Handoff report format
107
+
108
+ Present a compact status table, then the anchor:
109
+
110
+ ```
111
+ | Check | Result | Status |
112
+ |----------------|-----------------------------------------|--------|
113
+ | Memory | MEMORY.md + ADR-022 updated | ✓ |
114
+ | Git | tree clean, savepoint a1b2c3d | ✓ |
115
+ | Open PRs | gh pr list → 0 open | ✓ |
116
+ | Files | resume anchor written, plan SSOT synced | ✓ |
117
+ ```
118
+
119
+ ## Worked example
120
+
121
+ **Input** (user, at ~78% context):
122
+ > 컴팩션하고 이어서 진행할 수 있게 준비해줘
123
+
124
+ **Output:**
125
+
126
+ Handoff checks:
127
+
128
+ | Check | Result | Status |
129
+ |----------|-------------------------------------------------|--------|
130
+ | Memory | MEMORY.md code-quality-cycle entry refreshed | ✓ |
131
+ | Git | tree clean (no savepoint needed) | ✓ |
132
+ | Open PRs | `gh pr list` → **1 open: #165** CI pass, mergeable | ⚠ surfaced |
133
+ | Files | resume anchor below; persona-feedback SSOT synced | ✓ |
134
+
135
+ **Resume anchor**
136
+ - **Current state:** Persona P0 fixes in progress on branch `fix/persona-p0`. 2 of 4 fixes landed.
137
+ - **Verified:** `npm run ci` exit 0 on current HEAD (typecheck + lint + coverage branches 88.4% + build). Docker scenario for `--with <id>` exit 0.
138
+ - **What's left:** Fixes 3 & 4 (wizard copy + COMPATIBILITY honesty). **Open PR #165 is not merged** — publishing is blocked until P0 complete (per persona-feedback-improvements.md). Branch `fix/persona-p0` unpushed for the last 2 commits.
139
+ - **Next action:** Implement fix 3 (wizard understanding-category copy), then re-run `npm run ci`.
140
+
141
+ **Suggested /compact line:**
142
+ `/compact Resume fix/persona-p0: do fix 3 (wizard copy) next, then ci; PR #165 open & blocks publish until all 4 P0 done`
143
+
144
+ Notice what the anchor does deliberately: it leads with the **next action** (working-backwards), keeps
145
+ the open PR visible in "what's left" instead of assuming it shipped, and separates the *verified* CI
146
+ evidence from the *unverified* remaining fixes.
147
+
148
+ ## Pitfalls this guards against
149
+
150
+ - **Lossy summary-stripping** — a headline-only handoff looks complete but is unrecoverable because
151
+ the evidence and reasoning chains are gone. Fix: preserve-list + ADR provenance, not a prose blob.
152
+ - **Lost-in-the-Middle over-stuffing** — dumping full history into the new window doesn't help;
153
+ buried facts follow a U-shaped accuracy curve. The anchor stays short and structured.
154
+ - **Telephone-game decay** — fidelity erodes across successive compactions. Routing load-bearing
155
+ facts to stable memory/ADRs breaks the degradation chain a chain of summaries can't.
156
+ - **Non-atomic snapshot** — handing off with a half-committed tree reconstructs an inconsistent
157
+ state. The git leg forces a clean or savepoint-committed tree.
158
+ - **Incomplete state save** — omitting the processed-vs-remaining boundary forces duplicate work or
159
+ silent re-execution of irreversible actions. The "what's left" field is mandatory.
160
+
161
+ ## References
162
+
163
+ - Anthropic — *Effective context engineering for AI agents* (Compaction; Structured Note-Taking /
164
+ Agentic Memory): https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
165
+ - Hendricks.ai — Checkpoint patterns (State Completeness, Atomicity, Recovery Automation):
166
+ https://hendricks.ai/insights/checkpoint-patterns-long-running-ai-agent-tasks
167
+ - XTrace — AI agent context handoff (Decisions / Artifacts / Preferences / Timeline; decision
168
+ provenance): https://xtrace.ai/blog/ai-agent-context-handoff
169
+ - Architectural Decision Records (ADR) — append-only single-decision records; route load-bearing
170
+ decisions here so the *why* survives compaction.
171
+ - Amazon *Working Backwards* — lead the anchor with the end state and the single next action.
172
+
173
+ ## Related skills
174
+
175
+ - **strategic-compact** — decides WHEN to compact (phase boundaries, token pressure). Pair with it:
176
+ it answers "compact now?", this skill executes the handoff when the answer is yes.
177
+ - **git-policy Session Cleanup** — the open-PR check (leg 2) is the same gate; this skill folds it
178
+ into the pre-compaction moment.
@@ -0,0 +1,241 @@
1
+ ---
2
+ name: gap-analysis-e2e
3
+ description: >-
4
+ Two-mode chained gap analysis. DETECT scans the current service end-to-end
5
+ through three lenses — north-star alignment, correctness (bugs), and
6
+ user-perspective (UX) — and enumerates concrete, severity-ranked gaps. Then
7
+ BENCHMARK researches how reference/benchmark services actually solved each
8
+ high-ranked gap and PROPOSES a closing approach. Use when the user says any of:
9
+ "북극성 기준으로 부족한 점", "사용자 관점에서 부족한 점", "다른 벤치마크 서비스는 이 부분을
10
+ 어떻게 해결했는지", "갭분석", "레퍼런스 서비스랑 비교해서 부족한 점 찾아줘" — or the
11
+ English equivalents: "gap analysis", "what are we missing vs the
12
+ ideal/north-star", "benchmark against reference services". Fires for both
13
+ Korean and English phrasing. NOT for a whole-codebase multi-dimension audit
14
+ (use ultracode-service-audit) or single-artifact prose review (use
15
+ multi-persona-review) — this is the narrower gap-vs-benchmark loop.
16
+ ---
17
+
18
+ # Gap Analysis E2E (reverse + competitive)
19
+
20
+ A targeted loop, not a sweep. You take the service as it is today, find where it
21
+ falls short of the ideal **and** where it is simply broken or awkward, then for
22
+ each real shortfall you go look at how a benchmark service solved that exact
23
+ problem before proposing a fix. The user's own framing:
24
+
25
+ > "현재 서비스 개발 상태에서 북극성 기준으로 부족하거나, 버그가 있거나, 사용자 관점에서 부족한 점을
26
+ > 인지하면 다른 벤치마크 서비스는 이부분을 어떻게 해결했는지를 확인해 가는 거지."
27
+
28
+ So the work fuses two moves that are usually done separately: **reverse-gap**
29
+ (distance from the north-star / ideal) and **competitive benchmark resolution**
30
+ (how others closed the same gap). DETECT finds gaps; BENCHMARK closes them. They
31
+ chain: detect → for each gap, benchmark → propose fix.
32
+
33
+ ## When to use
34
+
35
+ - You have a working-ish service and want to know, concretely, where it's behind
36
+ its own north star, where it has bugs, and where the UX disappoints.
37
+ - You've found a weak spot and want "다른 유사 깃허브 프로젝트 / 레퍼런스 SaaS 는 이걸
38
+ 어떻게 했지?" before inventing a fix.
39
+ - You want a ranked, auditable list of gap → benchmark evidence → proposed close,
40
+ not a vague "we could improve X."
41
+
42
+ Not for: directing the roadmap forward (that's `northstar-roadmap` — it DIRECTS;
43
+ this DETECTS gaps against the same north star). Not a full N-dimension audit
44
+ (that's `ultracode-service-audit`; this is a narrower gap-to-benchmark loop).
45
+
46
+ ---
47
+
48
+ ## MODE 1 — DETECT
49
+
50
+ Run three **independent** passes, then consolidate. The usability and gap-analysis
51
+ literature is unanimous that one undifferentiated pass systematically under-finds:
52
+ heuristic evaluation works precisely because several evaluators inspect separately
53
+ and you aggregate (Nielsen & Molich). Blend the lenses into one sweep and you
54
+ will miss large categories of gap. So scan each lens on its own terms, then merge.
55
+
56
+ A gap is only valid if it is a **concrete delta between two describable states** —
57
+ the observable current state and a specific ideal state. "It feels unpolished" is
58
+ an opinion; "the onboarding has no empty-state for zero projects, the ideal is a
59
+ guided first-run" is a gap. (Gap Analysis: Current → Future State.)
60
+
61
+ ### Lens A — North-star alignment (the reverse / planning lens)
62
+
63
+ For the ideal state, use a **Working-Backwards** artifact: write (or read, if it
64
+ exists in `docs/NORTH_STAR.md`) the one-paragraph press release of the finished,
65
+ ideal product, then reason backward. The gap is the distance between today's
66
+ product and that press release (Amazon PR-FAQ). Then make it testable with the
67
+ **North Star Framework**: is each surface tied to a north-star *input* metric? Two
68
+ gap shapes fall out automatically:
69
+
70
+ - an input lever that should move the north star but doesn't, and
71
+ - product surface area that contributes to **no** input (candidate for removal).
72
+
73
+ Where this repo's north star lives: `docs/NORTH_STAR.md` and `CLAUDE.md`
74
+ ("설치 서비스 = installer + curator"). Judge surfaces against *that*, not taste.
75
+
76
+ No `docs/NORTH_STAR.md`? Don't skip Lens A — write the one-paragraph
77
+ Working-Backwards press release *inline* from the README / `CLAUDE.md` vision
78
+ first, then score against it. The ideal state is the anchor; an absent file is no
79
+ excuse to drop the planning lens.
80
+
81
+ ### Lens B — Correctness (the bug lens)
82
+
83
+ Inspect for things that are simply wrong: broken flows, crashes, mismatched
84
+ advertised-vs-actual behavior, drift between docs and code. In this repo the
85
+ `no-false-ship` rule names the exact failure family — a `--with-*` flag that's
86
+ advertised but unregistered, a `--version` that lies, a category missing from the
87
+ wizard. Treat each as a correctness gap with a reproduction, not a hunch.
88
+
89
+ ### Lens C — User-perspective (the UX lens)
90
+
91
+ Judge the interface against **named criteria**, not vibes — Nielsen's 10
92
+ heuristics (visibility of system status, match to the real world, error
93
+ prevention, recognition over recall, etc.) so each finding traces to a principle
94
+ and is reproducible. For the heavy UX pass, hand this lens to the
95
+ **`multi-persona-review`** skill (independent persona evaluators) rather than
96
+ duplicating its machinery here. Remember the limits: heuristic inspection finds
97
+ roughly half of what real user testing finds and produces false positives — it's
98
+ a cheap first filter, not ground truth.
99
+
100
+ ### Score every gap before you spend benchmark effort
101
+
102
+ Never present an unranked gap list — the benchmark research in Mode 2 is the
103
+ expensive part, so it must run only on gaps that matter. Tag each gap with:
104
+
105
+ - **Severity 0–4** (Nielsen): roughly frequency × impact × persistence. 0 = not
106
+ really a problem, 4 = catastrophe, must fix before release.
107
+ - **Opportunity (optional, ODI)**: `Importance + max(Importance − Satisfaction, 0)`
108
+ (importance weighted twice; Ulwick). High-importance/low-satisfaction =
109
+ under-served, prime target. Low-importance/high-satisfaction = **over-served** —
110
+ flag it for *removal/simplification*, not addition. Surfacing over-served areas
111
+ is the structural antidote to feature bloat; a good scan proposes cuts too.
112
+
113
+ DETECT is fully usable on **severity 0–4 alone**. ODI needs real importance and
114
+ satisfaction data; for a solo/tooling repo without it, *skip* the Opp. column
115
+ rather than inventing importance/satisfaction numbers — fabricated inputs launder a
116
+ guess as data. Reach for ODI only when you genuinely have user-sourced signal.
117
+
118
+ Keep the numbers as a prioritization aid, not proof — self-reported importance and
119
+ made-up severity launder a guess as data if you over-trust them.
120
+
121
+ **DETECT output** — one table:
122
+
123
+ | # | Lens | Gap (current → ideal delta) | Severity 0–4 | Opp. | Notes / repro |
124
+ |---|------|------------------------------|--------------|------|---------------|
125
+
126
+ Scale the rigor to severity: a 4 earns the full reverse-from-ideal write-up; a 1
127
+ gets a one-line pre-flight note. Don't run the heavy PR-FAQ ritual on every tiny
128
+ gap — that's analysis paralysis.
129
+
130
+ ---
131
+
132
+ ## MODE 2 — BENCHMARK (runs only on high-ranked gaps)
133
+
134
+ For each gap worth closing, work like a **competitive teardown**: take apart how a
135
+ reference service *actually* solves that exact problem and document the **verified
136
+ mechanism** — the real flow, states, and copy you observed — not the assumed
137
+ implementation. This mirrors `no-false-ship`: claim only what you inspected. If you
138
+ couldn't verify how they do it, **say so** ("could not inspect — inferred") rather
139
+ than fabricating a plausible-sounding mechanism. Fictional evidence is the named
140
+ failure mode of both Working-Backwards and this skill.
141
+
142
+ Sources, in order of trust: the running reference product / its repo (first-hand),
143
+ then docs, then write-ups. For "다른 유사 깃허브 프로젝트 보고 수정", read their actual
144
+ code path, not their README claims.
145
+
146
+ Then **PROPOSE** the closing approach in **jobs-to-be-done** terms — what job does
147
+ the user need done — and consciously resist the **feature-parity trap**. Copying a
148
+ competitor's feature list is a catch-up trap that breeds bloat (Zune out-featured
149
+ the iPod and lost; customers wanted the job done, not the features). For each gap,
150
+ decide explicitly: does closing it defend table-stakes, or does a *differentiated*
151
+ approach make the competitor's solution irrelevant? Propose accordingly.
152
+
153
+ Record each proposed fix **ADR-style** — rationale + the rejected benchmark
154
+ alternative — so the whole chain is auditable. (This repo already has an
155
+ `architecture-decision-record` convention and `docs/decisions/`.)
156
+
157
+ **BENCHMARK output** — per high-ranked gap:
158
+
159
+ ```
160
+ Gap #N (sev X): <one line>
161
+ Benchmark: <service> — VERIFIED how they solve it: <real flow/state/copy>
162
+ [or: COULD NOT INSPECT — inferred, treat as hypothesis]
163
+ Job: <the customer job this gap blocks>
164
+ Proposed: <closing approach in JTBD terms — differentiate, don't mirror>
165
+ Rejected: <the benchmark's exact approach, and why not, if diverging>
166
+ ```
167
+
168
+ ---
169
+
170
+ ## The chain, in order
171
+
172
+ 1. **Define states.** Current (observable) + ideal (Working-Backwards press
173
+ release, anchored to `docs/NORTH_STAR.md`). A gap is the delta between them.
174
+ 2. **DETECT** — three independent passes (north-star / correctness / UX via
175
+ `multi-persona-review`), each against named criteria.
176
+ 3. **Consolidate & score** — merge into one table; severity 0–4 + optional ODI
177
+ opportunity; tag over-served items for removal.
178
+ 4. **BENCHMARK** — only the high-ranked gaps; verified teardown of how a reference
179
+ service solves each; mark anything unverified.
180
+ 5. **PROPOSE** — closing approach in JTBD terms, differentiate over parity-match,
181
+ recorded ADR-style with the rejected alternative.
182
+
183
+ ---
184
+
185
+ ## Worked example (Input → Output)
186
+
187
+ **Input:** "이 설치 서비스 갭분석 해줘 — 북극성 대비 부족한 점이랑 버그랑 UX, 그리고 다른
188
+ 벤치마크는 어떻게 했는지."
189
+
190
+ **DETECT (consolidated, abridged):**
191
+
192
+ | # | Lens | Gap (current → ideal) | Sev | Opp | Notes |
193
+ |---|------|------------------------|-----|-----|-------|
194
+ | 1 | North-star | Wizard lists assets but never explains *why* each is vetted; north star is "이해하고 선택", so an unexplained list under-serves the core job | 3 | 14 | no provenance/★ shown at select time |
195
+ | 2 | Correctness | `--with-foo` advertised in README but crashes (flag unregistered) | 4 | — | repro: `install --with-foo` → CAC throw |
196
+ | 3 | UX | First run gives no "what happens next" status (Nielsen: visibility of system status) | 3 | 11 | via multi-persona-review |
197
+ | 4 | North-star (over-served) | Three near-duplicate verbose `--help` walls; low importance, high satisfaction | 1 | 2 | candidate for **removal** |
198
+
199
+ **BENCHMARK (gap #1, high-ranked):**
200
+
201
+ ```
202
+ Gap #1 (sev 3): wizard shows assets with no "why vetted" at decision time
203
+ Benchmark: VS Code Marketplace — VERIFIED: each extension card shows
204
+ install count + verified-publisher badge + star rating inline
205
+ in the pick list, so the trust signal sits at the moment of choice.
206
+ Job: "I need to trust this asset enough to install it, right here."
207
+ Proposed: Inline a one-line provenance (source repo + ★ + 'vetted: <date>')
208
+ on each wizard row — surface the trust signal at decision time.
209
+ Differentiator: we curate, so add a one-line *curator reason*,
210
+ which a raw marketplace can't.
211
+ Rejected: Marketplace's full detail-page-per-extension — too heavy for a
212
+ terminal wizard; defers the decision instead of supporting it.
213
+ ```
214
+
215
+ The gap #1 proposal lands as an ADR — e.g. `docs/decisions/ADR-0NN-wizard-provenance.md`
216
+ recording the inline-provenance decision and the rejected full-detail-page
217
+ alternative — so step 5's "record ADR-style" is concrete, not just advice.
218
+
219
+ Gap #2 (correctness, sev 4) skips benchmarking — it's a bug, fix directly and add
220
+ the drift guard `no-false-ship` requires. Gap #4 proposes deletion, not a
221
+ benchmark. That selective routing is the point: spend research only where it pays.
222
+
223
+ ---
224
+
225
+ ## Cross-references (don't duplicate)
226
+
227
+ - **`multi-persona-review`** — owns the UX lens (Lens C). Invoke it; don't
228
+ re-implement persona evaluation here.
229
+ - **`northstar-roadmap`** — same north star, opposite direction: it *directs* the
230
+ roadmap forward; this *detects* gaps against it.
231
+ - **`ultracode-service-audit`** — the full N-dimension sweep. This skill is the
232
+ narrower, faster gap → benchmark loop when you don't need the whole audit.
233
+ - **`architecture-decision-record`** — record each proposed fix as an ADR.
234
+
235
+ ## Notes on rigor (where deeper detail would live)
236
+
237
+ If a future version needs the full scoring rubrics (the complete Nielsen 10-item
238
+ checklist text, the ODI questionnaire wording) or per-domain benchmark source
239
+ lists, the option is to summarize here and split the long-form into a sibling
240
+ `reference.md` — no such file exists yet, and this SKILL.md is self-sufficient
241
+ without it. Keep SKILL.md the practical map, not the encyclopedia.