@uzysjung/agent-harness 26.86.0 → 26.87.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,209 @@
1
+ ---
2
+ name: multi-persona-review
3
+ description: >-
4
+ A panel-review skill that critiques ONE artifact (launch post, README, doc, markdown, plan,
5
+ design) via 3-5 disjoint user-perspective personas running in parallel, then synthesizes deduped,
6
+ severity-ranked improvement points (P0/P1/P2). Use when the user says "작성글을 사용자 관점의
7
+ 페르소나를 여러명 만들어서 (손넷 모델정도로) 피드백 받아바", "다면 리뷰 해볼까", "페르소나로 리뷰",
8
+ "여러 관점으로 피드백", or in English "multi-persona review", "review this from different user
9
+ perspectives", "get persona feedback on this post/README/doc", "panel review this artifact".
10
+ Lighter than a full service audit — point it at ONE artifact, not a whole codebase.
11
+ ---
12
+
13
+ # Multi-Persona Review (다면페르소나 워크플로우 리뷰)
14
+
15
+ Run a small panel of realistic target-user personas over one artifact, independently and in
16
+ parallel, then synthesize their findings into a deduped, prioritized fix list. This is how the
17
+ user actually works: "작성글을 사용자 관점의 페르소나를 여러명 만들어서 손넷 모델정도로 피드백 받아바"
18
+ and "이부분도 다면 리뷰 해볼까?" — 4-5 Sonnet-tier personas across 1-2 passes over a launch post,
19
+ yielding P0~P2 prioritized fixes.
20
+
21
+ ## When to use
22
+
23
+ - A draft is "done" but you want blind spots an author is fatigue-blind to: launch post, README,
24
+ PRD/plan, doc, marketing copy, a design.
25
+ - The user names personas or "다면 리뷰" / "여러 관점" / "multi-persona" / "panel review".
26
+ - You want **reproducible, severity-ranked** feedback, not one reviewer's gut reaction.
27
+
28
+ Do **not** use this for whole-codebase quality work — that's `ultracode-service-audit`. This skill
29
+ is deliberately lighter: one artifact, one panel, one synthesis. For surfacing missing user
30
+ journeys end-to-end, this feeds the UX lens of `gap-analysis-e2e`.
31
+
32
+ ## Why a panel beats one reviewer (the evidence)
33
+
34
+ The whole method rests on one empirical fact: **independent reviewers find largely
35
+ non-overlapping problems.**
36
+
37
+ - **Heuristic Evaluation (Nielsen & Molich) + the 3-5 evaluator rule** — a single evaluator
38
+ catches only ~35% of usability issues; aggregating independent evaluators raises coverage to
39
+ ~85% at five, with sharp diminishing returns beyond. The value comes from *low overlap between
40
+ perspectives*, not any one reviewer being thorough. Some of the hardest issues are found by an
41
+ evaluator who otherwise finds few. Each judges against the *same explicit checklist* so reviews
42
+ stay comparable and dedupable.
43
+ https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/theory-heuristic-evaluations/
44
+ - **Panel of LLM evaluators (PoLL)** — a panel of several smaller, *disjoint* judges beats one
45
+ large judge, shows less self-preference bias, and costs ~7x less. This is the cost-tier reason
46
+ the user runs the persona panel at Sonnet tier and reserves the main model for orchestration and
47
+ synthesis. https://arxiv.org/abs/2404.18796
48
+ - **"Nine Judges, Two Effective Votes"** — panels help *only to the extent members fail
49
+ independently*. A 9-judge panel carried only ~2 independent votes' worth of information because
50
+ the models made the same mistakes on the same items. The bottleneck is **correlated reviewers,
51
+ not panel size or aggregation math** — so persona design must maximize genuine viewpoint
52
+ diversity, not nominal count. https://arxiv.org/abs/2605.29800
53
+ - **LLM-as-persona-reviewer vs human experts (GPT-4o study)** — persona review finds many real
54
+ issues but also emits false positives humans wouldn't flag, and misses issues needing embodied
55
+ experience. Recommended posture: a **hybrid** where personas generate candidate findings that a
56
+ human validates — never a replacement for human judgment. https://arxiv.org/pdf/2506.16345
57
+ - **RICE prioritization (Intercom)** — (Reach × Impact × Confidence) / Effort turns rough guesses
58
+ into one comparable score, down-weighting low-confidence/high-effort items and countering the
59
+ reviewer's bias toward what they'd personally use. A lightweight analog gives a *defensible,
60
+ reproducible* map from findings to P0/P1/P2.
61
+ https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/
62
+
63
+ ## Core workflow
64
+
65
+ ### 1. Frame the artifact (orchestrator, main model)
66
+
67
+ Capture three things the personas will all share:
68
+ - **Goal** — what is this artifact trying to achieve? (e.g. "get a developer to `npx` install in
69
+ under 2 minutes and star the repo")
70
+ - **Audience** — who is the real target reader?
71
+ - **Rubric** — the shared checklist every persona scores against, so findings are comparable and
72
+ dedupable. Default rubric (adapt to the artifact): *clarity of value prop · first-action
73
+ friction · credibility/trust signals · scannability · accuracy/honesty · accessibility ·
74
+ call-to-action*. Without a shared rubric, red-team reviews decay into proofreading and generic
75
+ opinions, and findings stop being comparable across personas.
76
+
77
+ ### 2. Design 3-5 genuinely disjoint personas
78
+
79
+ Cap the panel at five — coverage flattens beyond that, and extra personas mostly inflate tokens
80
+ and false confidence (the "Nine Judges" trap). Engineer **diversity, not count**: pick personas
81
+ with disjoint goals, contexts, and *failure-fears* so their blind spots don't correlate. A strong
82
+ default spread:
83
+
84
+ | Persona | Lens / what they fear |
85
+ |---|---|
86
+ | Skeptical newcomer | Doesn't know the domain; fears wasting time on hype. Tests "do I get it in 10s?" |
87
+ | Time-pressured expert | Knows the domain; fears fluff between them and the command. Tests scannability + first action. |
88
+ | Accessibility-dependent user | Screen reader / low vision / non-native reader. Tests structure, alt text, plain language. |
89
+ | Hostile/adversarial reader | Looks for overclaims, vague benefits, anything to dismiss. Tests honesty + credibility. |
90
+ | Adjacent-tool migrant *(optional 5th)* | Already uses a competitor. Tests differentiation + "why switch?". |
91
+
92
+ Swap personas to fit the artifact (e.g. for a PRD: implementing engineer, on-call SRE, PM,
93
+ security reviewer). The test is always: would these two personas make the *same* mistake? If yes,
94
+ they're not independent — replace one.
95
+
96
+ ### 3. Review in parallel, independently (Sonnet-tier panel)
97
+
98
+ Spawn one sub-agent per persona via the **Task tool** (or the harness's sub-agent mechanism). Each
99
+ one gets the artifact + goal + audience + the *same* rubric, and **must not see the other personas'
100
+ output** — independence is the precondition that makes aggregation add information. Anchoring on a
101
+ peer collapses the panel toward one effective vote.
102
+
103
+ Prefer pinning the persona sub-agents to a cheaper tier (Sonnet) — see the cost-tier note. But this
104
+ degrades gracefully: if the harness can't pin sub-agents to a specific model, just run the panel on
105
+ the default sub-agent model and note in the step-6 coverage caveat that the panel ran at the
106
+ orchestrator tier. The tier is an economy, not a hard prerequisite.
107
+
108
+ Each persona returns findings as **strengths / weaknesses / specific recommendations**. Require
109
+ every finding to be specific and actionable: **quote the offending passage and propose a concrete
110
+ fix.** Ban vague "needs work" notes — that's the classic red-team failure mode (briefing +
111
+ structured findings + independence are the load-bearing parts, not the critical attitude).
112
+ https://loopio.com/blog/red-team-review/
113
+
114
+ ### 4. Synthesize: dedupe, but preserve minority findings (orchestrator, main model)
115
+
116
+ Collapse overlapping findings into one entry, noting *how many personas raised it* (frequency is a
117
+ prioritization signal). **But never drop a single-persona finding** — heuristic-evaluation data
118
+ says the hardest, most valuable issues are often raised by only one reviewer. Majority-vote /
119
+ consensus filtering would silently discard exactly those. Keep them, tagged as single-source.
120
+
121
+ ### 5. Prioritize with a transparent rule → P0/P1/P2
122
+
123
+ Map each finding to a bucket with a **reproducible** rule, not by gut feel or by which persona
124
+ phrased it loudest. Use a RICE-style or **severity × frequency** score:
125
+
126
+ - **P0** — blocks the artifact's goal for many readers (e.g. value prop unreadable in first
127
+ screen; a false claim). High impact × high confidence, any effort.
128
+ - **P1** — meaningfully hurts conversion/trust but has a workaround.
129
+ - **P2** — polish, edge-reader, or low-confidence/high-effort items.
130
+
131
+ Show the score inputs so the ranking is auditable.
132
+
133
+ ### 6. Triage as candidates, state coverage honestly
134
+
135
+ Present the list as **candidate findings needing a validation pass**, not gospel. Flag likely
136
+ false positives and note where real-user confirmation is warranted before committing fixes — LLM
137
+ personas both miss embodied issues and invent non-issues. End with an honest coverage caveat: a
138
+ panel never finds every issue and offers no systematic fix generation (Nielsen's own caveat).
139
+ Claiming exhaustiveness here would be a no-false-ship violation.
140
+
141
+ **Second pass (the "1-2 passes"):** run the same panel again *after fixes land* to confirm the P0s
142
+ are actually closed and that the edits didn't introduce new issues. One pass to find, one to verify
143
+ — a third rarely pays off.
144
+
145
+ ## Worked example (Input → Output)
146
+
147
+ **Input:** Trigger — "이 런치 포스트 다면 리뷰 해볼까? 손넷으로 페르소나 4명." Artifact: a launch
148
+ post for an npm installer CLI. Goal: "reader runs `npx ... init` and stars the repo." Audience:
149
+ indie devs scanning a feed.
150
+
151
+ **Panel (parallel, Sonnet tier):** skeptical newcomer · time-pressured expert ·
152
+ accessibility-dependent reader · hostile reader.
153
+
154
+ **Raw findings (excerpt):**
155
+ - Newcomer: "Paragraph 1 says 'context-engineered harness' — I don't know what that buys me.
156
+ Quote: *'A context-engineered harness for agentic CLIs.'* Fix: lead with the outcome — *'Install
157
+ vetted plugins, skills, and rules across 4 AI CLIs in one command.'*"
158
+ - Expert: "The install command is below three paragraphs of philosophy. Fix: move `npx` line to
159
+ the first screen." *(also raised by newcomer → frequency 2)*
160
+ - Accessibility: "Demo is a GIF with no text fallback; the actual command only appears in the GIF.
161
+ Fix: put the command in a code block as text."
162
+ - Hostile: "'Works everywhere' — claims 4 CLIs but only shows Claude. Fix: either show all four or
163
+ soften to 'Claude today, others in progress.'" *(single-source, kept)*
164
+
165
+ **Synthesized + prioritized output:**
166
+
167
+ | ID | Finding (deduped) | Personas | Sev × Freq | Bucket |
168
+ |---|---|---|---|---|
169
+ | F1 | Install command buried below the fold / inside GIF only | expert, newcomer, a11y | high × 3 | **P0** |
170
+ | F2 | Value prop is jargon, not outcome, in first screen | newcomer | high × 1 | **P0** |
171
+ | F3 | "Works everywhere" overclaims vs. evidence shown | hostile | med × 1 | **P1** |
172
+ | F4 | Demo GIF has no text alternative | a11y | med × 1 | **P1** |
173
+
174
+ **Caveat returned to user:** candidate findings from a 4-persona Sonnet panel; F3 (overclaim) is
175
+ worth confirming against what the post can actually demo before rewording. Not exhaustive — a real
176
+ indie-dev read may surface more.
177
+
178
+ This mirrors the user's real run (memory: `persona-feedback-improvements`, P0-before-publish gate).
179
+
180
+ ## Cost-tier note
181
+
182
+ Run the **persona panel at a cheaper tier (Sonnet)** — PoLL shows a disjoint panel of smaller
183
+ judges beats one big judge at a fraction of the cost. Reserve the **main/orchestrator model** for
184
+ framing the rubric and synthesizing (steps 1, 4-6), where reasoning quality pays off most.
185
+
186
+ ## Pitfalls to avoid
187
+
188
+ - **False diversity** — personas that share the model's default assumptions give far fewer than N
189
+ views. Design for disjoint fears; if two would make the same mistake, replace one.
190
+ - **Scaling count to fix quality** — past ~5 personas you mostly buy tokens and noise. Fix
191
+ independence, not size.
192
+ - **Consensus filtering** — dropping single-persona findings discards the rare, hard issues that
193
+ are the whole point.
194
+ - **Anchoring** — letting personas see each other's output before judging collapses the panel.
195
+ - **Opaque P0/P1/P2** — ranking by vibe or loudest wording is unauditable. Show the score.
196
+ - **Over-claiming coverage** — report it as candidate findings, never "found everything."
197
+
198
+ ## Cross-references
199
+
200
+ - `ultracode-service-audit` — full multi-dimensional audit of a whole service/codebase; this skill
201
+ is the lighter, single-artifact UX lens.
202
+ - `gap-analysis-e2e` — this skill feeds its UX/user-journey lens.
203
+ - `critique` — design-specific persona critique with anti-pattern detection; reach for it when the
204
+ artifact is a UI rather than prose/markdown.
205
+
206
+ > This SKILL.md is complete and self-contained — everything needed to run a panel is above. If the
207
+ > method ever needs deeper appendices (full default rubrics per artifact type, persona prompt
208
+ > templates, a RICE scoring worksheet), a `reference/` file alongside this SKILL.md is the place to
209
+ > add them. That's a future-extension option, not a missing dependency.
@@ -0,0 +1,176 @@
1
+ ---
2
+ name: northstar-roadmap
3
+ description: >-
4
+ Read the project's NORTH_STAR / vision doc, measure current state against the goal, then
5
+ propose a forward direction plus prioritized feature proposals — persisted as a durable
6
+ roadmap in docs/plans + memory so the plan survives /compact and new sessions. Use when
7
+ the user asks where the project should go next or wants a backlog grounded in the vision.
8
+ Fires on the user's real phrasings: "앞으로 어떤 방향으로 개선·발전시킬지 고민해봐",
9
+ "NORTH.md / NORTH_STAR 보고 나아갈 방향 + 기능 제안", "나아갈 방향 + 기능제안 (수용 → 계획 수립하고 메모리에 기록)",
10
+ "북극성 정렬 로드맵", as well as the English equivalents: "what direction should we take next",
11
+ "propose a roadmap / feature backlog from the north star", "plan the next milestones and save it
12
+ to memory". Not for detecting bugs or auditing current quality (see gap-analysis-e2e /
13
+ ultracode-service-audit) — this skill DIRECTS forward planning.
14
+ ---
15
+
16
+ # North-Star Roadmap (북극성 정렬 로드맵 + 기능 제안)
17
+
18
+ Turn a vision document into a forward direction and a ranked feature backlog, then write it
19
+ somewhere durable. The point is alignment, not idea generation: every proposal must trace
20
+ upward to the north-star, and the result must outlive the conversation that produced it.
21
+
22
+ ## When to use
23
+
24
+ Reach for this skill when the user steps back from day-to-day work and asks where the project
25
+ should head — typically with one of these (their actual phrasings):
26
+
27
+ - "앞으로 어떤 방향으로 개선·발전시킬지 고민해봐"
28
+ - "NORTH.md / NORTH_STAR 보고 나아갈 방향 + 기능 제안"
29
+ - "(제안) 수용 → 계획 수립하고 메모리에 기록"
30
+ - English: "what direction next", "propose a roadmap from the north star", "save the plan to memory"
31
+
32
+ Do **not** use it to find what's broken right now. Detecting defects, gaps, or quality regressions
33
+ is the job of the sibling skills below; this skill consumes their findings and points forward.
34
+
35
+ ## Why these steps (the frameworks underneath)
36
+
37
+ The workflow chains four established product-strategy methods so the output is defensible rather
38
+ than vibes. Reason with each — don't just cite it:
39
+
40
+ - **North Star Framework** (Amplitude) — a single North Star Metric is the destination; 3–5
41
+ directly-influenceable *Inputs* are the levers. You assess "current vs goal" against the inputs
42
+ (leading indicators teams can move), not lagging vanity numbers.
43
+ https://amplitude.com/books/north-star/about-north-star-framework
44
+ - **Working Backwards / PR-FAQ** (Amazon) — for a major proposal, sketch the future end-state first
45
+ (a one-line "press release" of the value the user gets), then derive the features. This forces
46
+ clarity and stops "we can build X because we know how" reasoning.
47
+ https://workingbackwards.com/concepts/working-backwards-pr-faq-process/
48
+ - **OKR lineage, not cascade** (Gothelf) — every roadmap item must have a *parent* it supports in
49
+ the north-star. Items invented bottom-up that don't ladder up get cut. This is the core alignment test.
50
+ https://jeffgothelf.com/blog/aligning-not-cascading-okrs-with-an-okr-lineage/
51
+ - **RICE / ICE scoring** (Intercom) — rank proposals by `(Reach × Impact × Confidence) / Effort`
52
+ (RICE), or `Impact × Confidence × Ease` (ICE) when data is thin. Confidence is where you honestly
53
+ discount exciting-but-unproven ideas. Scores are *inputs to a decision, not the verdict* — log
54
+ every strategic override.
55
+ https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/ ·
56
+ https://agileseekers.com/blog/feature-prioritization-using-rice-and-ice-models-in-product-roadmaps
57
+ - **Theme-based Now / Next / Later** — organize the roadmap by outcome themes and horizons, not
58
+ dated feature promises, so it ages gracefully and the why/what stays above the how/when.
59
+
60
+ ## Core workflow
61
+
62
+ ### 1. READ the north-star and restate it as Metric + Inputs
63
+
64
+ Read the project's vision doc (here: `docs/NORTH_STAR.md` — it already defines the North Star
65
+ Statement, the NSM, and measured Inputs). Restate the goal as **one North Star Metric + 3–5
66
+ influenceable Inputs**. If the doc already has them, lift them; if it only has a prose vision,
67
+ derive a candidate set and show it for confirmation.
68
+
69
+ Sanity-check the metric before trusting it:
70
+ - Is it a **leading** indicator of value, or a lagging one (raw revenue, registered users, page
71
+ views)? Lagging metrics are "what's done is done" — you can't steer by them.
72
+ - Is it **gameable**? "If you can move it directly without delivering value, it's not a good
73
+ north-star." Flag it instead of silently planning against a corrupt target.
74
+
75
+ > In this repo the literal north-star is **GitHub stars** (per memory and the service-audit
76
+ > roadmap), with the NORTH_STAR NSM (HITO ≤ 3/feature, low re-clarification) as the *value* the
77
+ > stars are supposed to reward. Plan toward stars **via** the value inputs, not by gaming the count.
78
+
79
+ ### 2. ASSESS current state against each Input — expose the gap
80
+
81
+ For each Input, state where the project is today vs target, using real evidence (existing plans,
82
+ audit output, metrics, code state). The deliverable is the **gap**: the distance between now and
83
+ the north-star, per lever. Be honest about unknowns — an unmeasured input is a gap too.
84
+
85
+ ### 3. PROPOSE direction + features by working backwards
86
+
87
+ First name the **forward direction** in a sentence or two — the theme(s) that close the biggest
88
+ gaps. Then, for each significant proposal:
89
+ - Write a one-line mini-PR (the future state: who gets what value once it ships).
90
+ - List the concrete feature(s) that realize it.
91
+ - State its **parent** — which Input / north-star pillar it supports. **No parent → cut it.**
92
+ This is the alignment gate that prevents bottom-up feature churn.
93
+
94
+ ### 4. PRIORITIZE with RICE (or ICE) and record overrides
95
+
96
+ Score each proposal. Use RICE when you have reach/effort signal; ICE for a thin-data first pass.
97
+ Write the numbers down so the ranking is auditable. Then apply judgment: dependencies, strategic
98
+ table-stakes, and north-star fit may override a score — **log WHY** for every override (the same
99
+ honesty the repo's `no-false-ship` and ADR/Decision-Log rules demand). Treat the score as a
100
+ decision aid, never an autopilot.
101
+
102
+ ### 5. PERSIST as a Now / Next / Later roadmap in docs + memory
103
+
104
+ Write the result to a durable artifact so it survives `/compact` and new sessions:
105
+ - **Update the existing roadmap SSOT** under `docs/plans/` (here: `service-audit-roadmap.md`) in
106
+ place, organized by outcome themes in **Now / Next / Later** horizons — not dated commitments.
107
+ Prefer one living doc over spawning a dated parallel (e.g. `northstar-roadmap-2026-06.md`): two
108
+ roadmaps drift. Only create a separate dated horizon doc if the user explicitly asks for one.
109
+ - A short anchor line in `memory/MEMORY.md` pointing at that doc, so it's re-referenced at session
110
+ start as a Persistent Anchor (per CLAUDE.md "SPEC/PRD는 매 세션 시작 시 재참조").
111
+ - If a real architectural decision was made (e.g. dropping a direction), record an ADR under
112
+ `docs/decisions/`, numbered as the next free integer (check the highest existing `ADR-NNN`
113
+ first), per `change-management.md`.
114
+
115
+ Then report back: the direction, the ranked table, and the file paths written. Mark it a **living
116
+ document** — both the metric and the priorities are expected to be revised as strategy evolves.
117
+
118
+ ## Worked example
119
+
120
+ **Input (user):** `/northstar-roadmap NORTH_STAR 보고 나아갈 방향 + 기능 제안 → 계획 세워서 메모리에 기록`
121
+
122
+ **Output (abridged):**
123
+
124
+ > **1. Metric + Inputs** (from `docs/NORTH_STAR.md`)
125
+ > North Star Metric: GitHub stars as the adoption signal that the harness delivers real value.
126
+ > Inputs: (a) HITO ≤ 3 / feature, (b) Promise = Implementation (advertised assets 100% work),
127
+ > (c) first-run install success, (d) vetted+security-audited asset count, (e) 4-CLI parity.
128
+ > *Leading-indicator check:* stars alone are gameable, so we plan against inputs (a)–(e), the value
129
+ > the stars reward — not the raw count.
130
+ >
131
+ > **2. Gap** — (a) HITO unmeasured in fresh installs · (b) closed after no-false-ship work · (c)
132
+ > install-matrix green but no post-publish telemetry · (e) plugins don't yet reach non-Claude CLIs.
133
+ >
134
+ > **3. Direction** — "Make the *first 60 seconds* prove the value before asking for a star."
135
+ > Proposals: P1 first-run telemetry (parent: Input c); P2 plugin→codex/opencode equivalence
136
+ > (parent: Input e); P3 public, continuously-tested compat+security matrix artifact (parent: Input d).
137
+ > Each carries a one-line mini-PR.
138
+ >
139
+ > **4. RICE** — P1 (R 5·I 2·C .8·E 2 = 4.0) · P3 (R 4·I 3·C .7·E 3 = 2.8) · P2 (R 2·I 3·C .6·E 5 = 0.7).
140
+ > *Override:* P3 promoted above its score — it's the repo's defensibility wedge (table-stakes for
141
+ > a 0-star cold start). Logged here and in the ADR.
142
+ >
143
+ > **5. Persisted** — updated the SSOT `docs/plans/service-audit-roadmap.md` in place (Now: P1 ·
144
+ > Next: P3 · Later: P2), added the anchor to `memory/MEMORY.md`, opened the next free ADR
145
+ > `docs/decisions/ADR-0NN-p3-override.md` for the strategic override.
146
+
147
+ ## Pitfalls to design against
148
+
149
+ - **Vanity / lagging / gameable north-star** — the two failure modes to refuse: picking a metric
150
+ you can't influence (revenue, raw users) or one you can move without delivering value.
151
+ - **False precision in RICE/ICE** — subjective Reach/Impact/Effort treated as exact truth. Confidence
152
+ exists to discount shaky estimates; skipping it yields authoritative-looking wrong rankings.
153
+ - **Score on autopilot** — shipping the top-RICE item while ignoring dependencies or strategic fit.
154
+ - **Dated feature-list roadmap** — timeline promises rot; outcome themes in Now/Next/Later age better.
155
+ - **Bottom-up idea dump** — proposals that don't ladder up to an Input. The alignment gate (step 3)
156
+ is the cure.
157
+ - **Plan that doesn't persist** — a great assessment that lives only in the chat and is lost at
158
+ `/compact`. The artifact in step 5 is the whole point.
159
+
160
+ ## Cross-references (siblings — do not duplicate)
161
+
162
+ - **gap-analysis-e2e** — *detects* north-star gaps end-to-end. This skill consumes those gaps as the
163
+ evidence in step 2.
164
+ - **ultracode-service-audit** — produces a multi-dimension audit and roadmap of *current* problems.
165
+ This skill takes that roadmap as input and points it forward.
166
+ - **strategic-compact** / project ADR + plan-SSOT conventions — the persistence mechanism (step 5)
167
+ reuses them rather than reinventing.
168
+
169
+ > Audit and gap skills answer "what's wrong now?". This skill answers "where do we go, and in what
170
+ > order?" — and makes the answer durable.
171
+
172
+ ## Reference (progressive disclosure)
173
+
174
+ This SKILL.md is the operating summary. If deeper method detail is ever needed — full RICE worked
175
+ calculations, a PR-FAQ template, or a roadmap-doc skeleton — add a `reference.md` beside this file
176
+ and link it here. Keep SKILL.md lean; the user dislikes verbose notepad docs.
@@ -0,0 +1,222 @@
1
+ ---
2
+ name: ultracode-service-audit
3
+ description: >-
4
+ Run a multi-agent, adversarially-verified full-service audit across 7 dimensions
5
+ (code / UX / scalability / planning+north-star / security / promotion / extensible),
6
+ separating findings into confirmed / unverified / rejected and producing a
7
+ priority-ranked, M-numbered milestone roadmap (as many milestones as the findings warrant).
8
+ Use when the user says
9
+ "ultracode 전체 서비스 점검", "전체 서비스를 점검하자", "코드·UX·확장성·기획·북극성지표·보안·홍보 문제점을 파악하고 우선순위에 따라 개선",
10
+ "다차원 서비스 감사", or in English "audit the whole service / full multi-dimensional service audit /
11
+ find code, UX, scalability, planning, security, and marketing problems and prioritize fixes".
12
+ The heavyweight superset audit — orchestrate it as a Workflow with fan-out finders and an adversarial verify pass.
13
+ ---
14
+
15
+ # Ultracode Service Audit
16
+
17
+ The heavyweight, multi-agent audit of an *entire* service across many dimensions at once.
18
+ The fan-out is orchestrated as a **Workflow / multi-agent run, and it can be large** — the real
19
+ run drove many agents in parallel, not a 7-agent minimum. "ultracode" implies that heavyweight
20
+ parallelism: a finder (often several) per dimension plus a separate squad of verifiers. Where a
21
+ single skill inspects one axis (UX, or code, or strategy), this one fans out finder agents per
22
+ dimension, then runs a **separate adversarial verification pass** so that only findings that
23
+ survive cross-examination are reported as real. The output is one priority-ranked roadmap where
24
+ every item is dimension-tagged, evidence-graded, and traceable to the product's North Star.
25
+
26
+ This is the skill behind the user's real request (turn 94):
27
+
28
+ > "ultracode 현재까지 개발한 내용을 기준으로 전체 서비스를 점검하자. 코드상 문제, UX 상 문제,
29
+ > 확장성 문제, 기획 및 북극성지표, 보안상, 홍보상의 문제점을 파악하고 각각의 개선점 ...
30
+ > 우선순위에 따라 개선하자"
31
+
32
+ That run produced **확정 29 / 미검증 0 / 기각 8** — the confirmed/unverified/rejected split
33
+ is not decoration, it is the whole point. The **미검증 0 was *that run's* outcome, not a
34
+ guarantee the bucket goes unused**: the 미검증 bucket is load-bearing and stays in the report
35
+ the moment any finding lands with no verifier votes. A finding nobody could verify never gets
36
+ reported as fact.
37
+
38
+ ## When to use
39
+
40
+ - The user wants a **whole-service health check**, not one narrow review — "전체 서비스를 점검하자",
41
+ "다차원 감사", "audit everything before launch / before we promote".
42
+ - You have the **Workflow / ultracode multi-agent capability** available (this skill assumes you
43
+ can fan out independent agents and re-aggregate). Without it, fall back to running the
44
+ dimensions sequentially yourself — but step 3 forbids a finder from grading itself, so
45
+ sequential mode **cannot produce a true 확정**. In that mode, never emit 확정 verdicts: label
46
+ every finding 미검증 or evidence-backed-only (a failing test / exposed secret / reproduced
47
+ crash counts; an opinion does not), and state plainly in the report that no independent
48
+ verification ran. That keeps the no-false-ship invariant honest when fan-out is absent.
49
+ - You need an output that is **prioritized and trustworthy** — every claim graded, weak claims
50
+ sunk in the ranking, nothing over-claimed.
51
+
52
+ If the user only wants one axis, use the focused sibling instead (see Cross-references). This
53
+ skill is the superset; don't reach for it when a scalpel will do.
54
+
55
+ ## The seven dimensions (extensible)
56
+
57
+ Each dimension gets a **named, enumerated rubric** before any auditing starts, so findings are
58
+ checked against explicit criteria rather than vibes. This is the discipline behind heuristic
59
+ evaluation (Nielsen / NN/g): a violation of a named rule is a *candidate* defect, justified
60
+ against context — not an automatic one.
61
+
62
+ | # | Dimension | Rubric to hand the finder agent |
63
+ |---|-----------|----------------------------------|
64
+ | 1 | **Code** | correctness/logic, security (injection, authz, exposed secrets), readability, tests-that-fail-when-logic-breaks, design/architecture fit (SonarSource multi-axis review) |
65
+ | 2 | **UX** | Nielsen's 10 usability heuristics; rate severity by impact, not by rule-match count |
66
+ | 3 | **Scalability** | data-model limits, hot paths, statefulness, single points of failure, cost-per-unit growth |
67
+ | 4 | **Planning + North Star** | the one North Star metric + its Inputs (Amplitude); does each finding move the metric or an Input? |
68
+ | 5 | **Security** | secrets exposure, authz boundaries, dependency CVEs, input trust, data egress |
69
+ | 6 | **Promotion / Marketing** | Working-Backwards: take the product's *implied PR/FAQ* (its promised value) and audit whether the built service + its messaging actually deliver — surface over-claim / false-ship gaps |
70
+ | 7 | **+ Extensible** | add a dimension by giving it (a) a named rubric and (b) its own independent verifier. Nothing else changes |
71
+
72
+ The framework set is load-bearing, not ornamental:
73
+ - **Heuristic Evaluation / Nielsen's 10** — named criteria per dimension, independent evaluators,
74
+ severity by impact. <https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/>
75
+ - **Multi-axis code review** — distinct correctness/security/tests/design axes beat one
76
+ "looks good". <https://www.sonarsource.com/resources/library/code-review/>
77
+ - **North Star Framework** — anchors dimension 4; filters strategically immaterial noise.
78
+ <https://amplitude.com/books/north-star/about-north-star-framework>
79
+ - **Working Backwards (PR/FAQ)** — the inverse-test for promotion + planning: promised value vs
80
+ delivered. <https://workingbackwards.com/concepts/working-backwards-pr-faq-process/>
81
+
82
+ ## Core workflow
83
+
84
+ Orchestrate this as a **Workflow**: fan-out → adversarial verify → synthesize.
85
+
86
+ ### 1. Scope and set North Star (pre-flight)
87
+ Read the service's SPEC/PRD/NORTH_STAR and recent state. Name the North Star metric and its
88
+ Inputs explicitly — they are the strategic anchor every finding will be tested against. If you
89
+ can't state the North Star, stop and ask; auditing dimensions in isolation with no anchor just
90
+ generates busywork.
91
+
92
+ ### 2. Fan-out: independent finder(s) per dimension
93
+ Spawn at least one finder agent for each dimension with **its own rubric** (table above) — and
94
+ spawn several per dimension where the surface is large. This is the heavyweight step: a real run
95
+ fans out to **many agents in parallel**, not a fixed seven. Run them
96
+ **independently** — NN/g's finding is that independent passes catch issues a single pass misses
97
+ (3 evaluators ≈ 60% of issues; one agent per dimension is not enough on its own, which is why
98
+ step 3 exists). Each finder returns candidate findings with: dimension tag, the rubric item
99
+ violated, the evidence it actually observed, and a proposed severity.
100
+
101
+ ### 3. Adversarial verify pass (the load-bearing step)
102
+ This is **a distinct second pass**, not the finders grading themselves. Re-order the reviewer
103
+ agents and give them *diverse* prompts/roles, then task them with peer-reviewing every round-one
104
+ assertion. This is Multi-Agent Verification (BoN-MAV): reliability scales at test time by
105
+ running multiple independent verifiers and accepting only what survives cross-validation.
106
+ <https://arxiv.org/pdf/2502.20379>
107
+
108
+ Decision rule per finding:
109
+ - **Confirmed (확정)** — survives verification, or carries irrefutable evidence (a failing test,
110
+ an exposed secret, a reproduced crash). Verifier consensus, not one voice.
111
+ - **Unverified (미검증)** — **0 adversarial-verify votes** and no hard evidence. Kept in a
112
+ separate bucket. **Never reported as fact.** This is the no-false-ship invariant
113
+ (`.claude/rules/no-false-ship.md`).
114
+ - **Rejected (기각)** — majority of verifiers refute it (rubric-match without real defect, wrong
115
+ reasoning, already handled). A majority refute *kills* the finding.
116
+
117
+ > Engineer verifier diversity deliberately. MAV names the failure mode that breaks the whole
118
+ > ensemble: **correlated-verifier collapse** — if every reviewer shares the same model, prompt,
119
+ > and blind spot, the adversarial pass rubber-stamps wrong findings and hands you false
120
+ > confidence. Vary roles, ordering, and prompts so verifiers don't share blind spots. If you
121
+ > cannot achieve independence, say so in the report and downgrade your confidence accordingly.
122
+
123
+ ### 4. Cap the loops
124
+ Verification cost scales with verifier count and debate rounds. Set a hard ceiling on
125
+ revision/debate iterations (this is a `gates-taxonomy` Revision gate — iteration cap mandatory)
126
+ and **escalate to the user rather than loop forever** on a contested finding. Unbounded debate
127
+ buys diminishing returns at runaway token cost.
128
+
129
+ ### 5. Score surviving findings with RICE
130
+ Rank confirmed (and any carried-forward unverified) findings with **Reach × Impact × Confidence
131
+ / Effort** = impact per time worked.
132
+ <https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/>
133
+
134
+ The **Confidence multiplier is where the verification tier pays off** — map it directly:
135
+
136
+ | Verdict | RICE Confidence |
137
+ |---------|-----------------|
138
+ | Confirmed + hard evidence | 100% |
139
+ | Confirmed by verifier consensus | 80% |
140
+ | Unverified (carried, not dropped) | 50% or lower |
141
+
142
+ This makes weakly-evidenced findings **sink in the ranking automatically** — you carry them
143
+ honestly instead of either pretending they're certain or silently deleting them. Rank by raw
144
+ severity or gut feel and high-impact-but-unproven items jump the queue; the Confidence term
145
+ exists precisely to stop that.
146
+
147
+ ### 6. Synthesize the M-numbered roadmap
148
+ Cluster surviving findings (affinity-style), trace each to the North Star or an Input (drop the
149
+ strategically immaterial), and emit a milestone roadmap with **as many milestones as the
150
+ findings warrant** — M1, M2, … however far the work runs (the real run landed at M4). There is
151
+ no fixed milestone count; severity and clustering decide it. Every roadmap item carries:
152
+ **dimension tag · verdict · RICE score · North-Star linkage · evidence pointer.** Items that
153
+ don't move the metric or an Input are flagged as nice-to-have, not milestone-blocking.
154
+
155
+ ## No-false-ship evidence matrix
156
+
157
+ Because this audit *itself* can over-claim, report each dimension's verification the same way the
158
+ repo's `no-false-ship` rule demands for shipped features — per-path evidence, unverified shown as
159
+ unverified, never one path's evidence reused for another:
160
+
161
+ ```
162
+ | Dimension | Finder evidence | Verifier outcome | Verdict |
163
+ |-------------|----------------------------|-------------------------|-----------|
164
+ | Code | failing test repro'd | 3/3 verifiers confirm | 확정 |
165
+ | UX | heuristic #4 violation | 2/3 confirm, context ok | 확정 |
166
+ | Security | suspected authz gap | 0 verifier votes | 미검증 |
167
+ | Promotion | README claim vs built | majority refute | 기각 |
168
+ ```
169
+
170
+ A row with no verifier votes stays "미검증" in the final report. Hiding it and declaring "audit
171
+ complete" is exactly the false-ship failure this skill exists to prevent.
172
+
173
+ ## Worked example (Input → Output)
174
+
175
+ **Input** (user, verbatim trigger):
176
+ > "ultracode 전체 서비스를 점검하자 — 코드·UX·확장성·기획·북극성지표·보안·홍보 문제점 파악하고
177
+ > 우선순위에 따라 개선하자."
178
+
179
+ **Process:**
180
+ 1. Pre-flight: North Star = "weekly successful first-install completions"; Inputs = wizard
181
+ completion rate, CLI flag coverage, install success rate.
182
+ 2. Fan-out: 7 finders, each with its rubric. ~40 raw candidate findings.
183
+ 3. Adversarial verify (re-ordered, diverse verifiers): 29 survive, 8 refuted, several land in
184
+ 미검증 with 0 votes and stay there.
185
+ 4. Loop cap hit on one contested scalability claim → escalated to user, not debated to death.
186
+ 5. RICE: a confirmed-with-failing-test code bug (Confidence 100%) outranks a plausible-but-
187
+ unverified marketing gap (Confidence 50%) even though the marketing gap *felt* bigger.
188
+ 6. Synthesize.
189
+
190
+ **Output** (abridged):
191
+ ```
192
+ Service Audit — 확정 29 / 미검증 (carried) N / 기각 8
193
+
194
+ M1 (now): [Code·확정·RICE 9.6] install crash on --with-* flag (failing test attached)
195
+ [Security·확정·RICE 8.1] secret in committed config — moves Input "install success"
196
+ M2: [UX·확정·RICE 6.4] wizard step skips a category — moves Input "wizard completion"
197
+ M3: [Scale·확정·RICE 4.2] category list hardcoded in 2 places → derive
198
+ M4: [Promotion·확정·RICE 3.5] README over-claims a feature (Working-Backwards gap)
199
+ Parked: [Security·미검증·conf 50%] suspected authz gap — needs reproduction before action
200
+ Rejected: 8 findings (rubric-match without real defect / already handled)
201
+ (the run stopped at M4 — milestone count follows the findings, it is not a fixed five)
202
+ ```
203
+ Every M-item: dimension-tagged, verdict-graded, North-Star-linked, RICE-ranked.
204
+
205
+ ## Cross-references (don't duplicate — hand off)
206
+
207
+ - **UX dimension** can spawn the multi-persona UX review skill (`multi-persona-review`) for
208
+ deeper persona-based heuristic inspection instead of a single UX finder.
209
+ - **Gap findings** (built vs promised, missing E2E coverage) hand off to `gap-analysis-e2e`
210
+ rather than being re-derived here.
211
+ - **The roadmap output** feeds `northstar-roadmap`, which owns milestone sequencing and
212
+ North-Star input modeling in depth.
213
+ - For repo discipline this skill enforces: `.claude/rules/no-false-ship.md` (evidence matrix,
214
+ confirmed/unverified/rejected) and `.claude/rules/gates-taxonomy.md` (Revision-loop cap,
215
+ Escalation on contested findings).
216
+
217
+ ## Progressive disclosure
218
+
219
+ This SKILL.md is the operating manual. If per-dimension rubrics need to grow (e.g. a full
220
+ Nielsen severity scale, or a language-specific code-review checklist), put them in a
221
+ `reference/` file beside this one and link it here — keep this file lean. The extensibility
222
+ contract stays: a new dimension = a named rubric + its own independent verifier, nothing else.