@uzysjung/agent-harness 26.86.0 → 26.88.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,211 @@
1
+ ---
2
+ name: multi-persona-review
3
+ description: >-
4
+ A panel-review skill that critiques ONE artifact (launch post, README, doc, markdown, plan,
5
+ design) via 3-5 disjoint user-perspective personas running in parallel, then synthesizes deduped,
6
+ severity-ranked improvement points (P0/P1/P2). Use when the user says "작성글을 사용자 관점의
7
+ 페르소나를 여러명 만들어서 (손넷 모델정도로) 피드백 받아바", "다면 리뷰 해볼까", "페르소나로 리뷰",
8
+ "여러 관점으로 피드백", or in English "multi-persona review", "review this from different user
9
+ perspectives", "get persona feedback on this post/README/doc", "panel review this artifact".
10
+ Lighter than a full service audit — point it at ONE artifact, not a whole codebase. NOT for a
11
+ whole-codebase multi-dimension audit (use ultracode-service-audit) or a single-axis
12
+ gap-vs-benchmark loop (use gap-analysis-e2e).
13
+ ---
14
+
15
+ # Multi-Persona Review (다면페르소나 워크플로우 리뷰)
16
+
17
+ Run a small panel of realistic target-user personas over one artifact, independently and in
18
+ parallel, then synthesize their findings into a deduped, prioritized fix list. This is how the
19
+ user actually works: "작성글을 사용자 관점의 페르소나를 여러명 만들어서 손넷 모델정도로 피드백 받아바"
20
+ and "이부분도 다면 리뷰 해볼까?" — 4-5 Sonnet-tier personas across 1-2 passes over a launch post,
21
+ yielding P0~P2 prioritized fixes.
22
+
23
+ ## When to use
24
+
25
+ - A draft is "done" but you want blind spots an author is fatigue-blind to: launch post, README,
26
+ PRD/plan, doc, marketing copy, a design.
27
+ - The user names personas or "다면 리뷰" / "여러 관점" / "multi-persona" / "panel review".
28
+ - You want **reproducible, severity-ranked** feedback, not one reviewer's gut reaction.
29
+
30
+ Do **not** use this for whole-codebase quality work — that's `ultracode-service-audit`. This skill
31
+ is deliberately lighter: one artifact, one panel, one synthesis. For surfacing missing user
32
+ journeys end-to-end, this feeds the UX lens of `gap-analysis-e2e`.
33
+
34
+ ## Why a panel beats one reviewer (the evidence)
35
+
36
+ The whole method rests on one empirical fact: **independent reviewers find largely
37
+ non-overlapping problems.**
38
+
39
+ - **Heuristic Evaluation (Nielsen & Molich) + the 3-5 evaluator rule** — a single evaluator
40
+ catches only ~35% of usability issues; aggregating independent evaluators raises coverage to
41
+ ~85% at five, with sharp diminishing returns beyond. The value comes from *low overlap between
42
+ perspectives*, not any one reviewer being thorough. Some of the hardest issues are found by an
43
+ evaluator who otherwise finds few. Each judges against the *same explicit checklist* so reviews
44
+ stay comparable and dedupable.
45
+ https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/theory-heuristic-evaluations/
46
+ - **Panel of LLM evaluators (PoLL)** — a panel of several smaller, *disjoint* judges beats one
47
+ large judge, shows less self-preference bias, and costs ~7x less. This is the cost-tier reason
48
+ the user runs the persona panel at Sonnet tier and reserves the main model for orchestration and
49
+ synthesis. https://arxiv.org/abs/2404.18796
50
+ - **"Nine Judges, Two Effective Votes"** — panels help *only to the extent members fail
51
+ independently*. A 9-judge panel carried only ~2 independent votes' worth of information because
52
+ the models made the same mistakes on the same items. The bottleneck is **correlated reviewers,
53
+ not panel size or aggregation math** — so persona design must maximize genuine viewpoint
54
+ diversity, not nominal count. https://arxiv.org/abs/2605.29800
55
+ - **LLM-as-persona-reviewer vs human experts (GPT-4o study)** — persona review finds many real
56
+ issues but also emits false positives humans wouldn't flag, and misses issues needing embodied
57
+ experience. Recommended posture: a **hybrid** where personas generate candidate findings that a
58
+ human validates — never a replacement for human judgment. https://arxiv.org/pdf/2506.16345
59
+ - **RICE prioritization (Intercom)** — (Reach × Impact × Confidence) / Effort turns rough guesses
60
+ into one comparable score, down-weighting low-confidence/high-effort items and countering the
61
+ reviewer's bias toward what they'd personally use. A lightweight analog gives a *defensible,
62
+ reproducible* map from findings to P0/P1/P2.
63
+ https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/
64
+
65
+ ## Core workflow
66
+
67
+ ### 1. Frame the artifact (orchestrator, main model)
68
+
69
+ Capture three things the personas will all share:
70
+ - **Goal** — what is this artifact trying to achieve? (e.g. "get a developer to `npx` install in
71
+ under 2 minutes and star the repo")
72
+ - **Audience** — who is the real target reader?
73
+ - **Rubric** — the shared checklist every persona scores against, so findings are comparable and
74
+ dedupable. Default rubric (adapt to the artifact): *clarity of value prop · first-action
75
+ friction · credibility/trust signals · scannability · accuracy/honesty · accessibility ·
76
+ call-to-action*. Without a shared rubric, red-team reviews decay into proofreading and generic
77
+ opinions, and findings stop being comparable across personas.
78
+
79
+ ### 2. Design 3-5 genuinely disjoint personas
80
+
81
+ Cap the panel at five — coverage flattens beyond that, and extra personas mostly inflate tokens
82
+ and false confidence (the "Nine Judges" trap). Engineer **diversity, not count**: pick personas
83
+ with disjoint goals, contexts, and *failure-fears* so their blind spots don't correlate. A strong
84
+ default spread:
85
+
86
+ | Persona | Lens / what they fear |
87
+ |---|---|
88
+ | Skeptical newcomer | Doesn't know the domain; fears wasting time on hype. Tests "do I get it in 10s?" |
89
+ | Time-pressured expert | Knows the domain; fears fluff between them and the command. Tests scannability + first action. |
90
+ | Accessibility-dependent user | Screen reader / low vision / non-native reader. Tests structure, alt text, plain language. |
91
+ | Hostile/adversarial reader | Looks for overclaims, vague benefits, anything to dismiss. Tests honesty + credibility. |
92
+ | Adjacent-tool migrant *(optional 5th)* | Already uses a competitor. Tests differentiation + "why switch?". |
93
+
94
+ Swap personas to fit the artifact (e.g. for a PRD: implementing engineer, on-call SRE, PM,
95
+ security reviewer). The test is always: would these two personas make the *same* mistake? If yes,
96
+ they're not independent — replace one.
97
+
98
+ ### 3. Review in parallel, independently (Sonnet-tier panel)
99
+
100
+ Spawn one sub-agent per persona via the **Task tool** (or the harness's sub-agent mechanism). Each
101
+ one gets the artifact + goal + audience + the *same* rubric, and **must not see the other personas'
102
+ output** — independence is the precondition that makes aggregation add information. Anchoring on a
103
+ peer collapses the panel toward one effective vote.
104
+
105
+ Prefer pinning the persona sub-agents to a cheaper tier (Sonnet) — see the cost-tier note. But this
106
+ degrades gracefully: if the harness can't pin sub-agents to a specific model, just run the panel on
107
+ the default sub-agent model and note in the step-6 coverage caveat that the panel ran at the
108
+ orchestrator tier. The tier is an economy, not a hard prerequisite.
109
+
110
+ Each persona returns findings as **strengths / weaknesses / specific recommendations**. Require
111
+ every finding to be specific and actionable: **quote the offending passage and propose a concrete
112
+ fix.** Ban vague "needs work" notes — that's the classic red-team failure mode (briefing +
113
+ structured findings + independence are the load-bearing parts, not the critical attitude).
114
+ https://loopio.com/blog/red-team-review/
115
+
116
+ ### 4. Synthesize: dedupe, but preserve minority findings (orchestrator, main model)
117
+
118
+ Collapse overlapping findings into one entry, noting *how many personas raised it* (frequency is a
119
+ prioritization signal). **But never drop a single-persona finding** — heuristic-evaluation data
120
+ says the hardest, most valuable issues are often raised by only one reviewer. Majority-vote /
121
+ consensus filtering would silently discard exactly those. Keep them, tagged as single-source.
122
+
123
+ ### 5. Prioritize with a transparent rule → P0/P1/P2
124
+
125
+ Map each finding to a bucket with a **reproducible** rule, not by gut feel or by which persona
126
+ phrased it loudest. Use a RICE-style or **severity × frequency** score:
127
+
128
+ - **P0** — blocks the artifact's goal for many readers (e.g. value prop unreadable in first
129
+ screen; a false claim). High impact × high confidence, any effort.
130
+ - **P1** — meaningfully hurts conversion/trust but has a workaround.
131
+ - **P2** — polish, edge-reader, or low-confidence/high-effort items.
132
+
133
+ Show the score inputs so the ranking is auditable.
134
+
135
+ ### 6. Triage as candidates, state coverage honestly
136
+
137
+ Present the list as **candidate findings needing a validation pass**, not gospel. Flag likely
138
+ false positives and note where real-user confirmation is warranted before committing fixes — LLM
139
+ personas both miss embodied issues and invent non-issues. End with an honest coverage caveat: a
140
+ panel never finds every issue and offers no systematic fix generation (Nielsen's own caveat).
141
+ Claiming exhaustiveness here would be a no-false-ship violation.
142
+
143
+ **Second pass (the "1-2 passes"):** run the same panel again *after fixes land* to confirm the P0s
144
+ are actually closed and that the edits didn't introduce new issues. One pass to find, one to verify
145
+ — a third rarely pays off.
146
+
147
+ ## Worked example (Input → Output)
148
+
149
+ **Input:** Trigger — "이 런치 포스트 다면 리뷰 해볼까? 손넷으로 페르소나 4명." Artifact: a launch
150
+ post for an npm installer CLI. Goal: "reader runs `npx ... init` and stars the repo." Audience:
151
+ indie devs scanning a feed.
152
+
153
+ **Panel (parallel, Sonnet tier):** skeptical newcomer · time-pressured expert ·
154
+ accessibility-dependent reader · hostile reader.
155
+
156
+ **Raw findings (excerpt):**
157
+ - Newcomer: "Paragraph 1 says 'context-engineered harness' — I don't know what that buys me.
158
+ Quote: *'A context-engineered harness for agentic CLIs.'* Fix: lead with the outcome — *'Install
159
+ vetted plugins, skills, and rules across 4 AI CLIs in one command.'*"
160
+ - Expert: "The install command is below three paragraphs of philosophy. Fix: move `npx` line to
161
+ the first screen." *(also raised by newcomer → frequency 2)*
162
+ - Accessibility: "Demo is a GIF with no text fallback; the actual command only appears in the GIF.
163
+ Fix: put the command in a code block as text."
164
+ - Hostile: "'Works everywhere' — claims 4 CLIs but only shows Claude. Fix: either show all four or
165
+ soften to 'Claude today, others in progress.'" *(single-source, kept)*
166
+
167
+ **Synthesized + prioritized output:**
168
+
169
+ | ID | Finding (deduped) | Personas | Sev × Freq | Bucket |
170
+ |---|---|---|---|---|
171
+ | F1 | Install command buried below the fold / inside GIF only | expert, newcomer, a11y | high × 3 | **P0** |
172
+ | F2 | Value prop is jargon, not outcome, in first screen | newcomer | high × 1 | **P0** |
173
+ | F3 | "Works everywhere" overclaims vs. evidence shown | hostile | med × 1 | **P1** |
174
+ | F4 | Demo GIF has no text alternative | a11y | med × 1 | **P1** |
175
+
176
+ **Caveat returned to user:** candidate findings from a 4-persona Sonnet panel; F3 (overclaim) is
177
+ worth confirming against what the post can actually demo before rewording. Not exhaustive — a real
178
+ indie-dev read may surface more.
179
+
180
+ This mirrors the user's real run (memory: `persona-feedback-improvements`, P0-before-publish gate).
181
+
182
+ ## Cost-tier note
183
+
184
+ Run the **persona panel at a cheaper tier (Sonnet)** — PoLL shows a disjoint panel of smaller
185
+ judges beats one big judge at a fraction of the cost. Reserve the **main/orchestrator model** for
186
+ framing the rubric and synthesizing (steps 1, 4-6), where reasoning quality pays off most.
187
+
188
+ ## Pitfalls to avoid
189
+
190
+ - **False diversity** — personas that share the model's default assumptions give far fewer than N
191
+ views. Design for disjoint fears; if two would make the same mistake, replace one.
192
+ - **Scaling count to fix quality** — past ~5 personas you mostly buy tokens and noise. Fix
193
+ independence, not size.
194
+ - **Consensus filtering** — dropping single-persona findings discards the rare, hard issues that
195
+ are the whole point.
196
+ - **Anchoring** — letting personas see each other's output before judging collapses the panel.
197
+ - **Opaque P0/P1/P2** — ranking by vibe or loudest wording is unauditable. Show the score.
198
+ - **Over-claiming coverage** — report it as candidate findings, never "found everything."
199
+
200
+ ## Cross-references
201
+
202
+ - `ultracode-service-audit` — full multi-dimensional audit of a whole service/codebase; this skill
203
+ is the lighter, single-artifact UX lens.
204
+ - `gap-analysis-e2e` — this skill feeds its UX/user-journey lens.
205
+ - `critique` — design-specific persona critique with anti-pattern detection; reach for it when the
206
+ artifact is a UI rather than prose/markdown.
207
+
208
+ > This SKILL.md is complete and self-contained — everything needed to run a panel is above. If the
209
+ > method ever needs deeper appendices (full default rubrics per artifact type, persona prompt
210
+ > templates, a RICE scoring worksheet), a `reference/` file alongside this SKILL.md is the place to
211
+ > add them. That's a future-extension option, not a missing dependency.
@@ -0,0 +1,176 @@
1
+ ---
2
+ name: northstar-roadmap
3
+ description: >-
4
+ Read the project's NORTH_STAR / vision doc, measure current state against the goal, then
5
+ propose a forward direction plus prioritized feature proposals — persisted as a durable
6
+ roadmap in docs/plans + memory so the plan survives /compact and new sessions. Use when
7
+ the user asks where the project should go next or wants a backlog grounded in the vision.
8
+ Fires on the user's real phrasings: "앞으로 어떤 방향으로 개선·발전시킬지 고민해봐",
9
+ "NORTH.md / NORTH_STAR 보고 나아갈 방향 + 기능 제안", "나아갈 방향 + 기능제안 (수용 → 계획 수립하고 메모리에 기록)",
10
+ "북극성 정렬 로드맵", as well as the English equivalents: "what direction should we take next",
11
+ "propose a roadmap / feature backlog from the north star", "plan the next milestones and save it
12
+ to memory". Not for detecting bugs or auditing current quality (see gap-analysis-e2e /
13
+ ultracode-service-audit) — this skill DIRECTS forward planning.
14
+ ---
15
+
16
+ # North-Star Roadmap (북극성 정렬 로드맵 + 기능 제안)
17
+
18
+ Turn a vision document into a forward direction and a ranked feature backlog, then write it
19
+ somewhere durable. The point is alignment, not idea generation: every proposal must trace
20
+ upward to the north-star, and the result must outlive the conversation that produced it.
21
+
22
+ ## When to use
23
+
24
+ Reach for this skill when the user steps back from day-to-day work and asks where the project
25
+ should head — typically with one of these (their actual phrasings):
26
+
27
+ - "앞으로 어떤 방향으로 개선·발전시킬지 고민해봐"
28
+ - "NORTH.md / NORTH_STAR 보고 나아갈 방향 + 기능 제안"
29
+ - "(제안) 수용 → 계획 수립하고 메모리에 기록"
30
+ - English: "what direction next", "propose a roadmap from the north star", "save the plan to memory"
31
+
32
+ Do **not** use it to find what's broken right now. Detecting defects, gaps, or quality regressions
33
+ is the job of the sibling skills below; this skill consumes their findings and points forward.
34
+
35
+ ## Why these steps (the frameworks underneath)
36
+
37
+ The workflow chains four established product-strategy methods so the output is defensible rather
38
+ than vibes. Reason with each — don't just cite it:
39
+
40
+ - **North Star Framework** (Amplitude) — a single North Star Metric is the destination; 3–5
41
+ directly-influenceable *Inputs* are the levers. You assess "current vs goal" against the inputs
42
+ (leading indicators teams can move), not lagging vanity numbers.
43
+ https://amplitude.com/books/north-star/about-north-star-framework
44
+ - **Working Backwards / PR-FAQ** (Amazon) — for a major proposal, sketch the future end-state first
45
+ (a one-line "press release" of the value the user gets), then derive the features. This forces
46
+ clarity and stops "we can build X because we know how" reasoning.
47
+ https://workingbackwards.com/concepts/working-backwards-pr-faq-process/
48
+ - **OKR lineage, not cascade** (Gothelf) — every roadmap item must have a *parent* it supports in
49
+ the north-star. Items invented bottom-up that don't ladder up get cut. This is the core alignment test.
50
+ https://jeffgothelf.com/blog/aligning-not-cascading-okrs-with-an-okr-lineage/
51
+ - **RICE / ICE scoring** (Intercom) — rank proposals by `(Reach × Impact × Confidence) / Effort`
52
+ (RICE), or `Impact × Confidence × Ease` (ICE) when data is thin. Confidence is where you honestly
53
+ discount exciting-but-unproven ideas. Scores are *inputs to a decision, not the verdict* — log
54
+ every strategic override.
55
+ https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/ ·
56
+ https://agileseekers.com/blog/feature-prioritization-using-rice-and-ice-models-in-product-roadmaps
57
+ - **Theme-based Now / Next / Later** — organize the roadmap by outcome themes and horizons, not
58
+ dated feature promises, so it ages gracefully and the why/what stays above the how/when.
59
+
60
+ ## Core workflow
61
+
62
+ ### 1. READ the north-star and restate it as Metric + Inputs
63
+
64
+ Read the project's vision doc (here: `docs/NORTH_STAR.md` — it already defines the North Star
65
+ Statement, the NSM, and measured Inputs). Restate the goal as **one North Star Metric + 3–5
66
+ influenceable Inputs**. If the doc already has them, lift them; if it only has a prose vision,
67
+ derive a candidate set and show it for confirmation.
68
+
69
+ Sanity-check the metric before trusting it:
70
+ - Is it a **leading** indicator of value, or a lagging one (raw revenue, registered users, page
71
+ views)? Lagging metrics are "what's done is done" — you can't steer by them.
72
+ - Is it **gameable**? "If you can move it directly without delivering value, it's not a good
73
+ north-star." Flag it instead of silently planning against a corrupt target.
74
+
75
+ > In this repo the literal north-star is **GitHub stars** (per memory and the service-audit
76
+ > roadmap), with the NORTH_STAR NSM (HITO ≤ 3/feature, low re-clarification) as the *value* the
77
+ > stars are supposed to reward. Plan toward stars **via** the value inputs, not by gaming the count.
78
+
79
+ ### 2. ASSESS current state against each Input — expose the gap
80
+
81
+ For each Input, state where the project is today vs target, using real evidence (existing plans,
82
+ audit output, metrics, code state). The deliverable is the **gap**: the distance between now and
83
+ the north-star, per lever. Be honest about unknowns — an unmeasured input is a gap too.
84
+
85
+ ### 3. PROPOSE direction + features by working backwards
86
+
87
+ First name the **forward direction** in a sentence or two — the theme(s) that close the biggest
88
+ gaps. Then, for each significant proposal:
89
+ - Write a one-line mini-PR (the future state: who gets what value once it ships).
90
+ - List the concrete feature(s) that realize it.
91
+ - State its **parent** — which Input / north-star pillar it supports. **No parent → cut it.**
92
+ This is the alignment gate that prevents bottom-up feature churn.
93
+
94
+ ### 4. PRIORITIZE with RICE (or ICE) and record overrides
95
+
96
+ Score each proposal. Use RICE when you have reach/effort signal; ICE for a thin-data first pass.
97
+ Write the numbers down so the ranking is auditable. Then apply judgment: dependencies, strategic
98
+ table-stakes, and north-star fit may override a score — **log WHY** for every override (the same
99
+ honesty the repo's `no-false-ship` and ADR/Decision-Log rules demand). Treat the score as a
100
+ decision aid, never an autopilot.
101
+
102
+ ### 5. PERSIST as a Now / Next / Later roadmap in docs + memory
103
+
104
+ Write the result to a durable artifact so it survives `/compact` and new sessions:
105
+ - **Update the existing roadmap SSOT** under `docs/plans/` (here: `service-audit-roadmap.md`) in
106
+ place, organized by outcome themes in **Now / Next / Later** horizons — not dated commitments.
107
+ Prefer one living doc over spawning a dated parallel (e.g. `northstar-roadmap-2026-06.md`): two
108
+ roadmaps drift. Only create a separate dated horizon doc if the user explicitly asks for one.
109
+ - A short anchor line in `memory/MEMORY.md` pointing at that doc, so it's re-referenced at session
110
+ start as a Persistent Anchor (per CLAUDE.md "SPEC/PRD는 매 세션 시작 시 재참조").
111
+ - If a real architectural decision was made (e.g. dropping a direction), record an ADR under
112
+ `docs/decisions/`, numbered as the next free integer (check the highest existing `ADR-NNN`
113
+ first), per `change-management.md`.
114
+
115
+ Then report back: the direction, the ranked table, and the file paths written. Mark it a **living
116
+ document** — both the metric and the priorities are expected to be revised as strategy evolves.
117
+
118
+ ## Worked example
119
+
120
+ **Input (user):** `/northstar-roadmap NORTH_STAR 보고 나아갈 방향 + 기능 제안 → 계획 세워서 메모리에 기록`
121
+
122
+ **Output (abridged):**
123
+
124
+ > **1. Metric + Inputs** (from `docs/NORTH_STAR.md`)
125
+ > North Star Metric: GitHub stars as the adoption signal that the harness delivers real value.
126
+ > Inputs: (a) HITO ≤ 3 / feature, (b) Promise = Implementation (advertised assets 100% work),
127
+ > (c) first-run install success, (d) vetted+security-audited asset count, (e) 4-CLI parity.
128
+ > *Leading-indicator check:* stars alone are gameable, so we plan against inputs (a)–(e), the value
129
+ > the stars reward — not the raw count.
130
+ >
131
+ > **2. Gap** — (a) HITO unmeasured in fresh installs · (b) closed after no-false-ship work · (c)
132
+ > install-matrix green but no post-publish telemetry · (e) plugins don't yet reach non-Claude CLIs.
133
+ >
134
+ > **3. Direction** — "Make the *first 60 seconds* prove the value before asking for a star."
135
+ > Proposals: P1 first-run telemetry (parent: Input c); P2 plugin→codex/opencode equivalence
136
+ > (parent: Input e); P3 public, continuously-tested compat+security matrix artifact (parent: Input d).
137
+ > Each carries a one-line mini-PR.
138
+ >
139
+ > **4. RICE** — P1 (R 5·I 2·C .8·E 2 = 4.0) · P3 (R 4·I 3·C .7·E 3 = 2.8) · P2 (R 2·I 3·C .6·E 5 = 0.7).
140
+ > *Override:* P3 promoted above its score — it's the repo's defensibility wedge (table-stakes for
141
+ > a 0-star cold start). Logged here and in the ADR.
142
+ >
143
+ > **5. Persisted** — updated the SSOT `docs/plans/service-audit-roadmap.md` in place (Now: P1 ·
144
+ > Next: P3 · Later: P2), added the anchor to `memory/MEMORY.md`, opened the next free ADR
145
+ > `docs/decisions/ADR-0NN-p3-override.md` for the strategic override.
146
+
147
+ ## Pitfalls to design against
148
+
149
+ - **Vanity / lagging / gameable north-star** — the two failure modes to refuse: picking a metric
150
+ you can't influence (revenue, raw users) or one you can move without delivering value.
151
+ - **False precision in RICE/ICE** — subjective Reach/Impact/Effort treated as exact truth. Confidence
152
+ exists to discount shaky estimates; skipping it yields authoritative-looking wrong rankings.
153
+ - **Score on autopilot** — shipping the top-RICE item while ignoring dependencies or strategic fit.
154
+ - **Dated feature-list roadmap** — timeline promises rot; outcome themes in Now/Next/Later age better.
155
+ - **Bottom-up idea dump** — proposals that don't ladder up to an Input. The alignment gate (step 3)
156
+ is the cure.
157
+ - **Plan that doesn't persist** — a great assessment that lives only in the chat and is lost at
158
+ `/compact`. The artifact in step 5 is the whole point.
159
+
160
+ ## Cross-references (siblings — do not duplicate)
161
+
162
+ - **gap-analysis-e2e** — *detects* north-star gaps end-to-end. This skill consumes those gaps as the
163
+ evidence in step 2.
164
+ - **ultracode-service-audit** — produces a multi-dimension audit and roadmap of *current* problems.
165
+ This skill takes that roadmap as input and points it forward.
166
+ - **strategic-compact** / project ADR + plan-SSOT conventions — the persistence mechanism (step 5)
167
+ reuses them rather than reinventing.
168
+
169
+ > Audit and gap skills answer "what's wrong now?". This skill answers "where do we go, and in what
170
+ > order?" — and makes the answer durable.
171
+
172
+ ## Reference (progressive disclosure)
173
+
174
+ This SKILL.md is the operating summary. If deeper method detail is ever needed — full RICE worked
175
+ calculations, a PR-FAQ template, or a roadmap-doc skeleton — add a `reference.md` beside this file
176
+ and link it here. Keep SKILL.md lean; the user dislikes verbose notepad docs.
@@ -0,0 +1,224 @@
1
+ ---
2
+ name: ultracode-service-audit
3
+ description: >-
4
+ Run a multi-agent, adversarially-verified full-service audit across 7 dimensions
5
+ (code / UX / scalability / planning+north-star / security / promotion / extensible),
6
+ separating findings into confirmed / unverified / rejected and producing a
7
+ priority-ranked, M-numbered milestone roadmap (as many milestones as the findings warrant).
8
+ Use when the user says
9
+ "ultracode 전체 서비스 점검", "전체 서비스를 점검하자", "코드·UX·확장성·기획·북극성지표·보안·홍보 문제점을 파악하고 우선순위에 따라 개선",
10
+ "다차원 서비스 감사", or in English "audit the whole service / full multi-dimensional service audit /
11
+ find code, UX, scalability, planning, security, and marketing problems and prioritize fixes".
12
+ The heavyweight superset audit — orchestrate it as a Workflow with fan-out finders and an adversarial verify pass.
13
+ NOT for a single-artifact prose/README review (use multi-persona-review) or a single-axis
14
+ gap-vs-benchmark loop (use gap-analysis-e2e) — those are the lighter siblings.
15
+ ---
16
+
17
+ # Ultracode Service Audit
18
+
19
+ The heavyweight, multi-agent audit of an *entire* service across many dimensions at once.
20
+ The fan-out is orchestrated as a **Workflow / multi-agent run, and it can be large** — the real
21
+ run drove many agents in parallel, not a 7-agent minimum. "ultracode" implies that heavyweight
22
+ parallelism: a finder (often several) per dimension plus a separate squad of verifiers. Where a
23
+ single skill inspects one axis (UX, or code, or strategy), this one fans out finder agents per
24
+ dimension, then runs a **separate adversarial verification pass** so that only findings that
25
+ survive cross-examination are reported as real. The output is one priority-ranked roadmap where
26
+ every item is dimension-tagged, evidence-graded, and traceable to the product's North Star.
27
+
28
+ This is the skill behind the user's real request (turn 94):
29
+
30
+ > "ultracode 현재까지 개발한 내용을 기준으로 전체 서비스를 점검하자. 코드상 문제, UX 상 문제,
31
+ > 확장성 문제, 기획 및 북극성지표, 보안상, 홍보상의 문제점을 파악하고 각각의 개선점 ...
32
+ > 우선순위에 따라 개선하자"
33
+
34
+ That run produced **확정 29 / 미검증 0 / 기각 8** — the confirmed/unverified/rejected split
35
+ is not decoration, it is the whole point. The **미검증 0 was *that run's* outcome, not a
36
+ guarantee the bucket goes unused**: the 미검증 bucket is load-bearing and stays in the report
37
+ the moment any finding lands with no verifier votes. A finding nobody could verify never gets
38
+ reported as fact.
39
+
40
+ ## When to use
41
+
42
+ - The user wants a **whole-service health check**, not one narrow review — "전체 서비스를 점검하자",
43
+ "다차원 감사", "audit everything before launch / before we promote".
44
+ - You have the **Workflow / ultracode multi-agent capability** available (this skill assumes you
45
+ can fan out independent agents and re-aggregate). Without it, fall back to running the
46
+ dimensions sequentially yourself — but step 3 forbids a finder from grading itself, so
47
+ sequential mode **cannot produce a true 확정**. In that mode, never emit 확정 verdicts: label
48
+ every finding 미검증 or evidence-backed-only (a failing test / exposed secret / reproduced
49
+ crash counts; an opinion does not), and state plainly in the report that no independent
50
+ verification ran. That keeps the no-false-ship invariant honest when fan-out is absent.
51
+ - You need an output that is **prioritized and trustworthy** — every claim graded, weak claims
52
+ sunk in the ranking, nothing over-claimed.
53
+
54
+ If the user only wants one axis, use the focused sibling instead (see Cross-references). This
55
+ skill is the superset; don't reach for it when a scalpel will do.
56
+
57
+ ## The seven dimensions (extensible)
58
+
59
+ Each dimension gets a **named, enumerated rubric** before any auditing starts, so findings are
60
+ checked against explicit criteria rather than vibes. This is the discipline behind heuristic
61
+ evaluation (Nielsen / NN/g): a violation of a named rule is a *candidate* defect, justified
62
+ against context — not an automatic one.
63
+
64
+ | # | Dimension | Rubric to hand the finder agent |
65
+ |---|-----------|----------------------------------|
66
+ | 1 | **Code** | correctness/logic, security (injection, authz, exposed secrets), readability, tests-that-fail-when-logic-breaks, design/architecture fit (SonarSource multi-axis review) |
67
+ | 2 | **UX** | Nielsen's 10 usability heuristics; rate severity by impact, not by rule-match count |
68
+ | 3 | **Scalability** | data-model limits, hot paths, statefulness, single points of failure, cost-per-unit growth |
69
+ | 4 | **Planning + North Star** | the one North Star metric + its Inputs (Amplitude); does each finding move the metric or an Input? |
70
+ | 5 | **Security** | secrets exposure, authz boundaries, dependency CVEs, input trust, data egress |
71
+ | 6 | **Promotion / Marketing** | Working-Backwards: take the product's *implied PR/FAQ* (its promised value) and audit whether the built service + its messaging actually deliver — surface over-claim / false-ship gaps |
72
+ | 7 | **+ Extensible** | add a dimension by giving it (a) a named rubric and (b) its own independent verifier. Nothing else changes |
73
+
74
+ The framework set is load-bearing, not ornamental:
75
+ - **Heuristic Evaluation / Nielsen's 10** — named criteria per dimension, independent evaluators,
76
+ severity by impact. <https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/>
77
+ - **Multi-axis code review** — distinct correctness/security/tests/design axes beat one
78
+ "looks good". <https://www.sonarsource.com/resources/library/code-review/>
79
+ - **North Star Framework** — anchors dimension 4; filters strategically immaterial noise.
80
+ <https://amplitude.com/books/north-star/about-north-star-framework>
81
+ - **Working Backwards (PR/FAQ)** — the inverse-test for promotion + planning: promised value vs
82
+ delivered. <https://workingbackwards.com/concepts/working-backwards-pr-faq-process/>
83
+
84
+ ## Core workflow
85
+
86
+ Orchestrate this as a **Workflow**: fan-out → adversarial verify → synthesize.
87
+
88
+ ### 1. Scope and set North Star (pre-flight)
89
+ Read the service's SPEC/PRD/NORTH_STAR and recent state. Name the North Star metric and its
90
+ Inputs explicitly — they are the strategic anchor every finding will be tested against. If you
91
+ can't state the North Star, stop and ask; auditing dimensions in isolation with no anchor just
92
+ generates busywork.
93
+
94
+ ### 2. Fan-out: independent finder(s) per dimension
95
+ Spawn at least one finder agent for each dimension with **its own rubric** (table above) — and
96
+ spawn several per dimension where the surface is large. This is the heavyweight step: a real run
97
+ fans out to **many agents in parallel**, not a fixed seven. Run them
98
+ **independently** — NN/g's finding is that independent passes catch issues a single pass misses
99
+ (3 evaluators ≈ 60% of issues; one agent per dimension is not enough on its own, which is why
100
+ step 3 exists). Each finder returns candidate findings with: dimension tag, the rubric item
101
+ violated, the evidence it actually observed, and a proposed severity.
102
+
103
+ ### 3. Adversarial verify pass (the load-bearing step)
104
+ This is **a distinct second pass**, not the finders grading themselves. Re-order the reviewer
105
+ agents and give them *diverse* prompts/roles, then task them with peer-reviewing every round-one
106
+ assertion. This is Multi-Agent Verification (BoN-MAV): reliability scales at test time by
107
+ running multiple independent verifiers and accepting only what survives cross-validation.
108
+ <https://arxiv.org/pdf/2502.20379>
109
+
110
+ Decision rule per finding:
111
+ - **Confirmed (확정)** — survives verification, or carries irrefutable evidence (a failing test,
112
+ an exposed secret, a reproduced crash). Verifier consensus, not one voice.
113
+ - **Unverified (미검증)** — **0 adversarial-verify votes** and no hard evidence. Kept in a
114
+ separate bucket. **Never reported as fact.** This is the no-false-ship invariant
115
+ (`.claude/rules/no-false-ship.md`).
116
+ - **Rejected (기각)** — majority of verifiers refute it (rubric-match without real defect, wrong
117
+ reasoning, already handled). A majority refute *kills* the finding.
118
+
119
+ > Engineer verifier diversity deliberately. MAV names the failure mode that breaks the whole
120
+ > ensemble: **correlated-verifier collapse** — if every reviewer shares the same model, prompt,
121
+ > and blind spot, the adversarial pass rubber-stamps wrong findings and hands you false
122
+ > confidence. Vary roles, ordering, and prompts so verifiers don't share blind spots. If you
123
+ > cannot achieve independence, say so in the report and downgrade your confidence accordingly.
124
+
125
+ ### 4. Cap the loops
126
+ Verification cost scales with verifier count and debate rounds. Set a hard ceiling on
127
+ revision/debate iterations (this is a `gates-taxonomy` Revision gate — iteration cap mandatory)
128
+ and **escalate to the user rather than loop forever** on a contested finding. Unbounded debate
129
+ buys diminishing returns at runaway token cost.
130
+
131
+ ### 5. Score surviving findings with RICE
132
+ Rank confirmed (and any carried-forward unverified) findings with **Reach × Impact × Confidence
133
+ / Effort** = impact per time worked.
134
+ <https://www.intercom.com/blog/rice-simple-prioritization-for-product-managers/>
135
+
136
+ The **Confidence multiplier is where the verification tier pays off** — map it directly:
137
+
138
+ | Verdict | RICE Confidence |
139
+ |---------|-----------------|
140
+ | Confirmed + hard evidence | 100% |
141
+ | Confirmed by verifier consensus | 80% |
142
+ | Unverified (carried, not dropped) | 50% or lower |
143
+
144
+ This makes weakly-evidenced findings **sink in the ranking automatically** — you carry them
145
+ honestly instead of either pretending they're certain or silently deleting them. Rank by raw
146
+ severity or gut feel and high-impact-but-unproven items jump the queue; the Confidence term
147
+ exists precisely to stop that.
148
+
149
+ ### 6. Synthesize the M-numbered roadmap
150
+ Cluster surviving findings (affinity-style), trace each to the North Star or an Input (drop the
151
+ strategically immaterial), and emit a milestone roadmap with **as many milestones as the
152
+ findings warrant** — M1, M2, … however far the work runs (the real run landed at M4). There is
153
+ no fixed milestone count; severity and clustering decide it. Every roadmap item carries:
154
+ **dimension tag · verdict · RICE score · North-Star linkage · evidence pointer.** Items that
155
+ don't move the metric or an Input are flagged as nice-to-have, not milestone-blocking.
156
+
157
+ ## No-false-ship evidence matrix
158
+
159
+ Because this audit *itself* can over-claim, report each dimension's verification the same way the
160
+ repo's `no-false-ship` rule demands for shipped features — per-path evidence, unverified shown as
161
+ unverified, never one path's evidence reused for another:
162
+
163
+ ```
164
+ | Dimension | Finder evidence | Verifier outcome | Verdict |
165
+ |-------------|----------------------------|-------------------------|-----------|
166
+ | Code | failing test repro'd | 3/3 verifiers confirm | 확정 |
167
+ | UX | heuristic #4 violation | 2/3 confirm, context ok | 확정 |
168
+ | Security | suspected authz gap | 0 verifier votes | 미검증 |
169
+ | Promotion | README claim vs built | majority refute | 기각 |
170
+ ```
171
+
172
+ A row with no verifier votes stays "미검증" in the final report. Hiding it and declaring "audit
173
+ complete" is exactly the false-ship failure this skill exists to prevent.
174
+
175
+ ## Worked example (Input → Output)
176
+
177
+ **Input** (user, verbatim trigger):
178
+ > "ultracode 전체 서비스를 점검하자 — 코드·UX·확장성·기획·북극성지표·보안·홍보 문제점 파악하고
179
+ > 우선순위에 따라 개선하자."
180
+
181
+ **Process:**
182
+ 1. Pre-flight: North Star = "weekly successful first-install completions"; Inputs = wizard
183
+ completion rate, CLI flag coverage, install success rate.
184
+ 2. Fan-out: 7 finders, each with its rubric. ~40 raw candidate findings.
185
+ 3. Adversarial verify (re-ordered, diverse verifiers): 29 survive, 8 refuted, several land in
186
+ 미검증 with 0 votes and stay there.
187
+ 4. Loop cap hit on one contested scalability claim → escalated to user, not debated to death.
188
+ 5. RICE: a confirmed-with-failing-test code bug (Confidence 100%) outranks a plausible-but-
189
+ unverified marketing gap (Confidence 50%) even though the marketing gap *felt* bigger.
190
+ 6. Synthesize.
191
+
192
+ **Output** (abridged):
193
+ ```
194
+ Service Audit — 확정 29 / 미검증 (carried) N / 기각 8
195
+
196
+ M1 (now): [Code·확정·RICE 9.6] install crash on --with-* flag (failing test attached)
197
+ [Security·확정·RICE 8.1] secret in committed config — moves Input "install success"
198
+ M2: [UX·확정·RICE 6.4] wizard step skips a category — moves Input "wizard completion"
199
+ M3: [Scale·확정·RICE 4.2] category list hardcoded in 2 places → derive
200
+ M4: [Promotion·확정·RICE 3.5] README over-claims a feature (Working-Backwards gap)
201
+ Parked: [Security·미검증·conf 50%] suspected authz gap — needs reproduction before action
202
+ Rejected: 8 findings (rubric-match without real defect / already handled)
203
+ (the run stopped at M4 — milestone count follows the findings, it is not a fixed five)
204
+ ```
205
+ Every M-item: dimension-tagged, verdict-graded, North-Star-linked, RICE-ranked.
206
+
207
+ ## Cross-references (don't duplicate — hand off)
208
+
209
+ - **UX dimension** can spawn the multi-persona UX review skill (`multi-persona-review`) for
210
+ deeper persona-based heuristic inspection instead of a single UX finder.
211
+ - **Gap findings** (built vs promised, missing E2E coverage) hand off to `gap-analysis-e2e`
212
+ rather than being re-derived here.
213
+ - **The roadmap output** feeds `northstar-roadmap`, which owns milestone sequencing and
214
+ North-Star input modeling in depth.
215
+ - For repo discipline this skill enforces: `.claude/rules/no-false-ship.md` (evidence matrix,
216
+ confirmed/unverified/rejected) and `.claude/rules/gates-taxonomy.md` (Revision-loop cap,
217
+ Escalation on contested findings).
218
+
219
+ ## Progressive disclosure
220
+
221
+ This SKILL.md is the operating manual. If per-dimension rubrics need to grow (e.g. a full
222
+ Nielsen severity scale, or a language-specific code-review checklist), put them in a
223
+ `reference/` file beside this one and link it here — keep this file lean. The extensibility
224
+ contract stays: a new dimension = a named rubric + its own independent verifier, nothing else.