npm - product-playbook - Versions diffs - 1.2.9 → 1.2.10 - Mend

product-playbook 1.2.9 → 1.2.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.es.md +29 -2
package/README.ja.md +29 -2
package/README.ko.md +29 -2
package/README.md +29 -2
package/README.zh-CN.md +29 -2
package/README.zh-TW.md +29 -2
package/SKILL.md +14 -0
package/i18n/es/references/02a-persona.md +42 -0
package/i18n/es/references/02b-jtbd.md +15 -8
package/i18n/ja/references/02a-persona.md +42 -0
package/i18n/ja/references/02b-jtbd.md +15 -8
package/i18n/ko/references/02a-persona.md +42 -0
package/i18n/ko/references/02b-jtbd.md +15 -8
package/i18n/zh-CN/references/02a-persona.md +42 -0
package/i18n/zh-CN/references/02b-jtbd.md +15 -8
package/i18n/zh-TW/references/02a-persona.md +42 -0
package/i18n/zh-TW/references/02b-jtbd.md +15 -8
package/package.json +1 -1
package/references/02a-persona.md +42 -0
package/references/02b-jtbd.md +15 -8

package/README.es.md CHANGED Viewed

@@ -479,6 +479,33 @@ Una iteración de reducción de tokens. Misma semántica del contenido del skill
 **Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko) preservando las traducciones existentes — el adelgazamiento estructural se aplicó de manera idéntica por idioma.
+### Iteración 7: Resiliencia del Harness de Evals (Sprint 1 + 2A, v1.2.9)
+Una iteración a nivel de harness, no a nivel de skill. La semántica del skill no cambió; lo que cambió es la *superficie que se mide*. Objetivo: hacer visible la línea base real de calidad desbloqueando 4 evals que venían produciendo veredictos 0/0 en silencio.
+**Sprint 1 — desbloquear los clusters no medibles (`d2023fb`, `cee67cb`):**
+Cuatro evals (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) venían produciendo 0 pass / 0 fail por corrida — indistinguibles de "sin problemas" en el puntaje agregado. Tres causas:
+1. **Sub-agents faltantes en CI headless** — CI instalaba el skill en `~/.claude/skills/` pero nunca copiaba `agents/*.md` a `~/.claude/agents/`. `claude -p` por lo tanto no podía despachar vía `Task`, y el orchestrator corría inline en silencio.
+2. **El hook de specialist-dispatch silencioso bajo `claude -p`** — los `hooks/` a nivel de plugin no se cargan en modo headless; sólo se cargan los UserPromptSubmit hooks de `~/.claude/settings.json` a nivel de usuario. CI ahora registra programáticamente el dispatch hook a nivel de usuario antes de cada corrida behavioral.
+3. **Timeouts de response + judge demasiado agresivos** — 180s response / 120s judge cortaban en medio las salidas largas de Discovery y Pre-mortem; el judge entonces veía un string truncado y emitía 0/0. Subido a 600s / 240s con un reintento ante salidas no-JSON.
+También se eliminaron las expectations procedurales tipo "el orchestrator delega vía Task tool" de los evals 10/11/12 — esas son inverificables en `claude -p` (sin superficie de Task anidado) y no son la propiedad que finalmente nos importa. Las expectations restantes apuntan a la *calidad de output* que el specialist habría producido.
+**Sprint 2A — robustez del judge + techo de CI (`f973939`):**
+Dos correcciones de seguimiento del code review del PR #9:
+1. **El reintento de repair del judge preserva el contexto original** — `claude -p` es stateless, así que el repair prompt ahora vuelve a incluir el `judge_prompt` original completo (response + expectations) más la salida malformed anterior. Una nueva verificación `_judge_output_complete()` rechaza payloads que no tengan exactamente N expectations indexadas, evitando que el modelo emita un veredicto plausible-pero-fabricado cuando la salida del primer call es irrecuperable.
+2. **Timeout del job `behavioral-eval` de CI 90 → 120 min** — el peor caso = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 min, así que el techo previo de 90 min podía cancelar en silencio una corrida válida. 120 min deja ~10 min de margen para setup + artifact upload.
+**Línea base recién visible** (corrida local, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectations pasando, **6 critical + 14 warning** failures. El puntaje agregado no regresó — lo que regresó es el puntaje *visible*, porque cuatro evals que antes contribuían 0/0 ahora producen señal real. Los 6 critical failures son el backlog explícito de Stage 2: JTBD de 3 capas (funcional / emocional / social), Jobs a nivel organizacional B2B, separación de persona buyer vs user en B2B, guardarraíles de scope de Discovery, y disciplina de leading-indicator en pre-mortem. Ver [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) para el desglose por expectation.
+**Las mejoras de harness viven en `evals/` y `.github/workflows/` — no se publican a npm.** No hace falta version bump más allá de v1.2.9 (que ya cargó el hook a nivel de usuario y las ediciones de scope a los evals 10/11/12).
+**Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko).
 ---
 ## 🧪 Desarrollo y Evals
@@ -487,7 +514,7 @@ El directorio `evals/` incluye dos suites de pruebas complementarias y un scorer
 **Local (gratis, recomendado)**: ejecuta los mismos scripts con el CLI `claude` autenticado con tu suscripción Claude Pro/Max (un solo `claude login`). Sin API key, sin costo marginal. El sistema de eval está diseñado para correr localmente antes de cada release.
-**CI (opcional, pago)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI requiere el secret `ANTHROPIC_API_KEY` (GitHub Actions no puede usar OAuth en un contenedor headless); sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
+**CI (opcional, sin costo adicional)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI también usa tu suscripción Claude Pro/Max (sin API key, sin costo por token): configuración única, ejecuta `claude setup-token` localmente y agregá el token impreso como secret `CLAUDE_CODE_OAUTH_TOKEN` del repo. Sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
 ### Ejecución local
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # solo reporta, sin exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y requiere el secret `ANTHROPIC_API_KEY`.
+Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y la misma suscripción vía un secret `CLAUDE_CODE_OAUTH_TOKEN` (generado una sola vez con `claude setup-token`).
 ### Severity y scoring

package/README.ja.md CHANGED Viewed

@@ -480,6 +480,33 @@ token 削減イテレーション。スキル内容のセマンティクスは
 **5 つの i18n ロケール（zh-TW、zh-CN、ja、es、ko）にミラー** — 既存の翻訳を保持しつつ、構造的なスリム化を言語ごとに同一に適用。
+### イテレーション7:Eval Harness のレジリエンス強化（Sprint 1 + 2A、v1.2.9）
+ハーネス層のイテレーションであって、スキル層ではない。スキルのセマンティクスは変わっていない。変わったのは**測定対象の表面積**。目標は、ずっと 0/0 verdict を黙って出していた 4 つの eval を解除し、真の品質ベースラインを浮かび上がらせること。
+**Sprint 1 — 測定不能だったクラスタの解除（`d2023fb`、`cee67cb`):**
+4 つの eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)は毎回 0 pass / 0 fail を返しており、集計スコアでは「問題なし」と区別がつかなかった。3 つの原因:
+1. **headless CI で sub-agent が欠落** — CI はスキルを `~/.claude/skills/` にインストールしていたが、`agents/*.md` を `~/.claude/agents/` にコピーしていなかった。そのため `claude -p` は `Task` 経由で dispatch できず、orchestrator が黙って inline で実行していた。
+2. **`claude -p` 下で specialist-dispatch hook が無音** — plugin レベルの `hooks/` は headless モードでは読み込まれず、user レベルの `~/.claude/settings.json` の UserPromptSubmit hook のみが読み込まれる。CI は各 behavioral run の前に dispatch hook を user レベルにプログラム的に登録するようになった。
+3. **Response + judge の timeout が短すぎた** — 180s response / 120s judge では長文の Discovery / Pre-mortem 出力が途中で切れ、judge は切り詰められた文字列を見て 0/0 を吐いていた。600s / 240s に引き上げ、非 JSON 出力時には 1 回リトライ。
+また evals 10/11/12 から「orchestrator が Task ツール経由で dispatch する」という手続き的 expectation も削除した — `claude -p` には nested Task の表面がなく検証不能で、最終的に我々が気にする性質でもない。残りの expectation は specialist が産出すべき**アウトプット品質**を対象とする。
+**Sprint 2A — judge のロバストネス + CI 上限(`f973939`):**
+PR #9 のコードレビューからの 2 つのフォローアップ修正:
+1. **Judge 修復リトライがオリジナルの context を保持** — `claude -p` はステートレスなので、修復 prompt は完全な元の `judge_prompt`(response + expectations)と前回の malformed output を再投入するようになった。新しい `_judge_output_complete()` チェックは N 個の indexed expectation がぴったり揃っていないペイロードを拒否し、初回出力が回復不能なときに model が「形だけ整った捏造 verdict」を吐くのを防ぐ。
+2. **CI `behavioral-eval` ジョブの timeout 90 → 120 分** — 最悪ケース = 12 evals / 2 workers × (600s response + 240s judge + 240s repair)≈ 108 分なので、以前の 90 分上限は有効な run を黙って cancel する可能性があった。120 分は setup + artifact upload に ~10 分の余裕を残す。
+**新たに可視化されたベースライン**(ローカル run、2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation がパス、**6 critical + 14 warning** の失敗。集計スコアは退行していない — 退行したのは**可視**スコアで、これまで 0/0 を貢献していた 4 つの eval が今は実シグナルを返すようになったため。この 6 つの critical 失敗が Stage 2 の明示的な backlog:3 層 JTBD(functional / emotional / social)、B2B 組織レベルの Jobs、B2B buyer vs user persona の分離、Discovery scope のガードレール、pre-mortem の leading-indicator 規律。expectation 単位の内訳は [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) を参照。
+**ハーネス改善は `evals/` と `.github/workflows/` に住み、npm には出荷されない。** v1.2.9 を超えるバージョンバンプは不要(v1.2.9 にはすでに user-level hook と evals 10/11/12 の scope 調整が含まれている)。
+**5 つの i18n ロケール(zh-TW、zh-CN、ja、es、ko)にミラー**。
 ---
 ## 🧪 開発と評価
@@ -488,7 +515,7 @@ token 削減イテレーション。スキル内容のセマンティクスは
 **ローカル（無料、推奨）**：`claude` CLI を Claude Pro/Max サブスクリプションで認証して（一度だけ `claude login`）同じスクリプトを実行できます。API key 不要、追加コストなし。eval システムは各リリース前にローカルで実行する設計です。
-**CI（オプション、有料）**：`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI には `ANTHROPIC_API_KEY` secret が必要です（GitHub Actions は headless コンテナで OAuth が使えません）。secret 未設定時は eval job が**クリーンに skip**（グレー ⏭️）され、誤解を招く赤バツは出ません。
+**CI（オプション、追加課金なし）**：`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI もあなたの Claude Pro/Max サブスクリプションを使用します（API key 不要、トークン課金なし）：ローカルで `claude setup-token` を一度実行し、出力されたトークンを repo secret `CLAUDE_CODE_OAUTH_TOKEN` として追加してください。secret 未設定時は eval job が**クリーンに skip**（グレー ⏭️）され、誤解を招く赤バツは出ません。
 ### ローカル実行
@@ -509,7 +536,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # レポートのみ、exi
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-ローカルは `--runs 3` がデフォルト（多数決で LLM のばらつきを吸収）。`claude` CLI は Claude Pro/Max の OAuth セッション（`claude login`）を使うため、トークン課金はありません。CI は `--runs 1` で、`ANTHROPIC_API_KEY` secret が必要です。
+ローカルは `--runs 3` がデフォルト（多数決で LLM のばらつきを吸収）。`claude` CLI は Claude Pro/Max の OAuth セッション（`claude login`）を使うため、トークン課金はありません。CI は `--runs 1` で、同じサブスクリプションを `CLAUDE_CODE_OAUTH_TOKEN` secret 経由で利用します（`claude setup-token` で一度だけ生成）。
 ### Severity とスコアリング

package/README.ko.md CHANGED Viewed

@@ -479,6 +479,33 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
 **5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링** — 기존 번역을 보존하며, 구조적 슬림화는 언어별로 동일하게 적용.
+### 반복 7: Eval Harness 회복탄력성 강화 (Sprint 1 + 2A, v1.2.9)
+스킬 레벨이 아니라 harness 레벨의 반복. 스킬의 의미는 바뀌지 않았고, 바뀐 것은 **측정되는 표면**. 목표는 0/0 verdict 만 조용히 내고 있던 4개 eval 의 차단을 풀어, 진짜 품질 베이스라인을 드러내는 것.
+**Sprint 1 — 측정 불능 클러스터 해제 (`d2023fb`, `cee67cb`):**
+4개 eval (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) 이 매번 0 pass / 0 fail 을 내고 있어, 집계 점수에서 「문제 없음」과 구분이 불가능했음. 세 가지 원인:
+1. **headless CI 에서 sub-agent 누락** — CI 가 스킬을 `~/.claude/skills/` 에 설치하면서 `agents/*.md` 를 `~/.claude/agents/` 에 복사하지 않았음. 그래서 `claude -p` 는 `Task` 로 dispatch 할 수 없었고, orchestrator 가 조용히 inline 으로 실행했음.
+2. **`claude -p` 에서 specialist-dispatch hook 이 무음** — 플러그인 레벨 `hooks/` 는 headless 모드에서 로드되지 않으며, user 레벨 `~/.claude/settings.json` 의 UserPromptSubmit hook 만 로드됨. CI 는 이제 각 behavioral run 전에 dispatch hook 을 프로그램적으로 user 레벨에 등록함.
+3. **Response + judge timeout 이 너무 빡빡함** — 180s response / 120s judge 가 장문의 Discovery / Pre-mortem 출력을 중간에 잘랐고, judge 는 잘린 문자열을 보고 0/0 을 뱉었음. 600s / 240s 로 올리고, 비-JSON 출력 시 1회 재시도.
+또한 evals 10/11/12 에서 「orchestrator 가 Task tool 로 dispatch」 같은 절차적 expectation 을 제거 — `claude -p` 에 nested Task 표면이 없어 검증 불가하며, 우리가 최종적으로 신경 쓰는 성질도 아님. 남은 expectation 은 specialist 가 산출했어야 할 **출력 품질**을 대상으로 함.
+**Sprint 2A — judge 견고성 + CI 상한 (`f973939`):**
+PR #9 코드 리뷰의 두 가지 후속 수정:
+1. **Judge repair 재시도가 원본 context 보존** — `claude -p` 는 stateless 이므로, repair prompt 는 원본 `judge_prompt` (response + expectations) 전체와 이전 malformed output 을 다시 포함함. 새 `_judge_output_complete()` 체크가 정확히 N 개의 indexed expectation 이 없는 payload 를 거부하여, 첫 호출 출력이 복구 불가능할 때 model 이 「형태만 그럴듯한 위조 verdict」를 내는 것을 방지.
+2. **CI `behavioral-eval` 작업 timeout 90 → 120 분** — 최악 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 분이므로, 이전 90분 상한은 유효한 run 을 조용히 cancel 할 수 있었음. 120분은 setup + artifact upload 에 ~10분 여유.
+**새로 가시화된 베이스라인** (로컬 run, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectation 통과, **6 critical + 14 warning** 실패. 집계 점수가 퇴보한 것이 아니라, **가시적**인 점수가 퇴보한 것 — 이전에 0/0 을 기여하던 4개 eval 이 이제 실제 signal 을 냄. 이 6개 critical 실패가 Stage 2 의 명시적 backlog: 3계층 JTBD (functional / emotional / social), B2B 조직 수준 Jobs, B2B buyer vs user persona 분리, Discovery scope 가드레일, pre-mortem leading-indicator 규율. expectation 단위 내역은 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) 참조.
+**Harness 개선은 `evals/` 와 `.github/workflows/` 에 거주하며, npm 으로 출하되지 않음.** v1.2.9 이상의 버전 bump 불필요 (v1.2.9 가 이미 user-level hook 과 evals 10/11/12 의 scope 조정을 포함).
+**5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링**.
 ---
 ## 🧪 개발 및 평가
@@ -487,7 +514,7 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
 **로컬 (무료, 권장)**: `claude` CLI를 Claude Pro/Max 구독으로 인증해서 (한 번만 `claude login`) 같은 스크립트를 실행합니다. API key 불필요, 추가 비용 없음. eval 시스템은 각 릴리스 전에 로컬에서 실행하도록 설계되었습니다.
-**CI (선택, 유료)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI는 `ANTHROPIC_API_KEY` secret이 필요합니다 (GitHub Actions는 headless 컨테이너에서 OAuth를 사용할 수 없습니다). secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
+**CI (선택, 추가 과금 없음)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI도 Claude Pro/Max 구독을 사용합니다 (API key 불필요, 토큰 과금 없음): 로컬에서 `claude setup-token`을 한 번 실행하고, 출력된 토큰을 repo secret `CLAUDE_CODE_OAUTH_TOKEN`으로 추가하세요. secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
 ### 로컬 실행
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # 보고만, exit 1 없음
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 `ANTHROPIC_API_KEY` secret이 필요합니다.
+로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 동일한 구독을 `CLAUDE_CODE_OAUTH_TOKEN` secret을 통해 인증합니다 (`claude setup-token`으로 한 번만 생성).
 ### Severity 및 스코어링

package/README.md CHANGED Viewed

@@ -477,6 +477,33 @@ A token-reduction iteration. Same skill content semantics, smaller footprint per
 **Mirrored to 5 i18n locales** (zh-TW, zh-CN, ja, es, ko) preserving existing translations — structural slim applied identically per language.
+### Iteration 7: Eval Harness Resilience (Sprint 1 + 2A, v1.2.9)
+A harness-level iteration, not a skill-level one. No skill semantics changed; the *surface area being measured* did. Goal: surface the real quality baseline by unblocking 4 evals that had been silently producing 0/0 verdicts.
+**Sprint 1 — unblock unmeasurable clusters (`d2023fb`, `cee67cb`):**
+Four evals (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) had been producing 0 passes / 0 fails per run — indistinguishable from "no problems" in the aggregate score. Three causes:
+1. **Sub-agents missing in headless CI** — CI installed the skill at `~/.claude/skills/` but never copied `agents/*.md` to `~/.claude/agents/`. `claude -p` therefore couldn't dispatch via `Task`, and the orchestrator silently inline-ran.
+2. **Specialist-dispatch hook silent under `claude -p`** — plugin-level `hooks/` are not loaded in headless mode; only user-level `~/.claude/settings.json` UserPromptSubmit hooks are. CI now programmatically registers the dispatch hook at the user level before each behavioral run.
+3. **Response + judge timeouts too aggressive** — 180s response / 120s judge cut off long-form Discovery and Pre-mortem outputs mid-thought; the judge then saw a truncated string and emitted 0/0. Bumped to 600s / 240s with a single retry on non-JSON output.
+Also dropped procedural "orchestrator delegates via Task tool" expectations from evals 10/11/12 — those are unverifiable in `claude -p` (no nested Task surface) and not the property we ultimately care about. The remaining expectations target the *output quality* the specialist would have produced.
+**Sprint 2A — judge robustness + CI ceiling (`f973939`):**
+Two follow-on fixes from PR #9 code review:
+1. **Judge repair retry preserves original context** — `claude -p` is stateless, so the repair prompt now re-includes the full original `judge_prompt` (response + expectations) plus the previous malformed output. A new `_judge_output_complete()` check rejects payloads that don't have exactly N indexed expectations, preventing the model from emitting a plausibly-shaped but fabricated verdict when the first call's output is unrecoverable.
+2. **CI `behavioral-eval` job timeout 90 → 120 min** — worst case = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 min, so the previous 90-min ceiling could silently cancel an otherwise valid run. 120 min leaves ~10 min headroom for setup + artifact upload.
+**Newly visible baseline** (local run, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectations passing, **6 critical + 14 warning** failures. The aggregate score did not regress — what regressed is the *visible* score, because four evals that previously contributed 0/0 now produce real signal. The 6 critical failures are now the explicit Stage 2 backlog: 3-layer JTBD (functional / emotional / social), B2B organization-level Jobs, B2B buyer-vs-user persona separation, Discovery-scope guardrails, and pre-mortem leading-indicator discipline. See [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) for the per-expectation breakdown.
+**Harness improvements live in `evals/` and `.github/workflows/` — they do not ship to npm.** No version bump beyond v1.2.9 (which carried the user-level hook + scope edits to evals 10/11/12).
+**Mirrored to 5 i18n locales** (zh-TW, zh-CN, ja, es, ko).
 ---
 ## 🧪 Development & Evals
@@ -485,7 +512,7 @@ The `evals/` directory ships two complementary test suites and a deterministic s
 **Local (free, recommended):** run the same scripts with the `claude` CLI authenticated via your Claude Pro/Max subscription (`claude login` once). No API key, no marginal cost. The eval system is designed to be run locally before each release.
-**CI (optional, paid):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI requires an `ANTHROPIC_API_KEY` secret because GitHub Actions cannot use OAuth in a headless container; without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
+**CI (optional, no extra billing):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI runs on your Claude Pro/Max subscription (no API key, no per-token cost): one-time setup is `claude setup-token` locally, then add the printed token as repo secret `CLAUDE_CODE_OAUTH_TOKEN`. Without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
 ### Running locally
@@ -506,7 +533,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # report without exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and requires the `ANTHROPIC_API_KEY` secret.
+Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and the same subscription via a `CLAUDE_CODE_OAUTH_TOKEN` secret (generated once with `claude setup-token`).
 ### Severity & scoring

package/README.zh-CN.md CHANGED Viewed

@@ -479,6 +479,33 @@ Claude Code 会自动：
 **已同步至 5 个 i18n 语系**（zh-TW、zh-CN、ja、es、ko），保留既有译文 —— 结构性瘦身按语系一致套用。
+### Iteration 7：Eval Harness 韧性强化（Sprint 1 + 2A，v1.2.9）
+Harness 层迭代，不是 skill 层。Skill 语意没变，变的是**被测量的表面**。目标：解除 4 个一直在悄悄产出 0/0 verdict 的 eval，让真实品质基线浮出水面。
+**Sprint 1 — 解锁无法测量的群集（`d2023fb`、`cee67cb`）：**
+4 个 eval（`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`）每次都产出 0 pass / 0 fail，在汇总分数中与「没问题」无法区分。三个原因：
+1. **CI headless 模式缺少 sub-agent** —— CI 把 skill 装到 `~/.claude/skills/`，却没把 `agents/*.md` 复制到 `~/.claude/agents/`。`claude -p` 因此无法透过 `Task` 派发，orchestrator 只能默默 inline 执行。
+2. **Specialist-dispatch hook 在 `claude -p` 不会载入** —— plugin 层的 `hooks/` 在 headless 模式不会载入，只有 user 层 `~/.claude/settings.json` 的 UserPromptSubmit hook 会。CI 现在会在每次 behavioral run 之前以程序方式把 dispatch hook 注册到 user 层。
+3. **Response + judge timeout 太紧** —— 180s response / 120s judge 会把长篇 Discovery、Pre-mortem 输出中途切断，judge 看到截断字串就吐出 0/0。提升到 600s / 240s，且非 JSON 输出时重试一次。
+同时也从 evals 10/11/12 删掉「orchestrator 必须透过 Task 派发」这类程序性 expectation —— 在 `claude -p` 没有 nested Task 介面，无法验证，也不是我们最终在意的性质。留下的 expectation 都针对 specialist 应产出的**输出质量**。
+**Sprint 2A — judge 韧性 + CI 上限（`f973939`）：**
+PR #9 review 之后的两个跟进修正：
+1. **Judge 修复重试保留原始 context** —— `claude -p` 是无状态的，所以修复 prompt 现在会重新带入完整原始 `judge_prompt`（response + expectations）加上前一次的 malformed output。新的 `_judge_output_complete()` 检查会拒绝「没有完整 N 个 indexed expectation」的回应，避免 model 在第一次输出无法救援时凭空捏造一份看起来合理的 verdict。
+2. **CI `behavioral-eval` job timeout 90 → 120 分钟** —— 最坏情况 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 分钟，先前 90 分钟上限可能默默 cancel 整轮 run。120 分钟给 setup + artifact upload 留 ~10 分钟余裕。
+**新可见的基线**（本机 run，2026-05-28）：**0 / 100** `at-risk`、**13 / 33** expectation 通过、**6 critical + 14 warning** 失败。汇总分数并没有退步，退的是**可见**分数 —— 四个原本贡献 0/0 的 eval 现在开始产出真实 signal。这 6 个 critical 失败就是 Stage 2 明确的待修清单：三层 JTBD（functional / emotional / social）、B2B 组织层 Jobs、B2B buyer vs user persona 分离、Discovery scope 守备、pre-mortem leading-indicator 纪律。逐项细节见 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md)。
+**Harness 改进住在 `evals/` 与 `.github/workflows/`，不会发到 npm。** 版本不需要再往 v1.2.9 之上 bump（v1.2.9 已经包含 user-level hook 与 evals 10/11/12 的 scope 调整）。
+**已同步至 5 个 i18n 语系**（zh-TW、zh-CN、ja、es、ko）。
 ---
 ## 🧪 开发与评测
@@ -487,7 +514,7 @@ Claude Code 会自动：
 **本地（免费，推荐）**：用 `claude` CLI 搭配你的 Claude Pro/Max 订阅（先 `claude login` 一次）跑这些 script。不需要 API key、没有额外成本。整套 eval 系统就是设计来在每次发版前本地跑一遍。
-**CI（可选，付费）**：`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`（含 `package.json` 变动）时跑这两套，把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 需要 `ANTHROPIC_API_KEY` secret（GitHub Actions 在 headless 容器无法走 OAuth）；没设 secret 时 eval job **会干净地 skip**（灰色 ⏭️），不会出现误导的红叉。
+**CI（可选，不额外计费）**：`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`（含 `package.json` 变动）时跑这两套，把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 同样走你的 Claude Pro/Max 订阅（不需 API key、没有按 token 计费的成本）：一次性设置为本机 `claude setup-token` 生成长期 token，把它加进 repo secret `CLAUDE_CODE_OAUTH_TOKEN`。没设 secret 时 eval job **会干净地 skip**（灰色 ⏭️），不会出现误导的红叉。
 ### 本地执行
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # 只报告，不 exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-本地默认 `--runs 3`（多数决可吸收 LLM 变异性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），没有按 token 计费的成本。CI 用 `--runs 1` 并需要 `ANTHROPIC_API_KEY` secret。
+本地默认 `--runs 3`（多数决可吸收 LLM 变异性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），没有按 token 计费的成本。CI 用 `--runs 1`，靠同一个订阅通过 `CLAUDE_CODE_OAUTH_TOKEN` secret 认证（用 `claude setup-token` 一次性生成）。
 ### Severity 与计分

package/README.zh-TW.md CHANGED Viewed

@@ -478,6 +478,33 @@ Claude Code 會自動：
 **5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko),保留既有翻譯——結構性瘦身在各語系等比例套用。
+### Iteration 7:Eval Harness 韌性強化(Sprint 1 + 2A,v1.2.9)
+Harness 層的迭代,不是 skill 層。Skill 語意沒變,變的是**被測量的表面**。目標:解除 4 個一直在悄悄產出 0/0 verdict 的 eval,讓真實品質基線浮出水面。
+**Sprint 1 — 解鎖無法測量的群集(`d2023fb`、`cee67cb`):**
+4 個 eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)每次都產出 0 pass / 0 fail,在彙總分數中與「沒問題」無法區分。三個原因:
+1. **CI headless 模式缺少 sub-agent** — CI 把 skill 裝到 `~/.claude/skills/`,卻沒把 `agents/*.md` 複製到 `~/.claude/agents/`。`claude -p` 因此無法透過 `Task` 派發,orchestrator 只能默默 inline 執行。
+2. **Specialist-dispatch hook 在 `claude -p` 不會載入** — plugin 層的 `hooks/` 在 headless 模式不會載入,只有 user 層 `~/.claude/settings.json` 的 UserPromptSubmit hook 會。CI 現在會在每次 behavioral run 之前以程式碼方式把 dispatch hook 註冊到 user 層。
+3. **Response + judge timeout 太緊** — 180s response / 120s judge 會把長篇 Discovery、Pre-mortem 輸出中途切斷,judge 看到截斷字串就吐出 0/0。提升到 600s / 240s,且非 JSON 輸出時重試一次。
+同時也從 evals 10/11/12 刪掉「orchestrator 必須透過 Task 派發」這類程序性 expectation——在 `claude -p` 沒有 nested Task 介面,無法驗證,也不是我們最終在意的性質。留下的 expectation 都針對 specialist 應產出的**輸出品質**。
+**Sprint 2A — judge 韌性 + CI 上限(`f973939`):**
+PR #9 review 之後的兩個跟進修正:
+1. **Judge 修復重試保留原始 context** — `claude -p` 是無狀態的,所以修復 prompt 現在會重新帶入完整原始 `judge_prompt`(response + expectations)加上前一次的 malformed output。新的 `_judge_output_complete()` 檢查會拒絕「沒有完整 N 個 indexed expectation」的回應,避免 model 在第一次輸出無法救援時憑空捏造一份看起來合理的 verdict。
+2. **CI `behavioral-eval` job timeout 90 → 120 分鐘** — 最壞情況 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 分鐘,先前 90 分鐘上限可能默默 cancel 整輪 run。120 分鐘給 setup + artifact upload 留 ~10 分鐘餘裕。
+**新可見的基線**(本機 run,2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation 通過、**6 critical + 14 warning** 失敗。彙總分數並沒有退步,退的是**可見**分數——四個原本貢獻 0/0 的 eval 現在開始產出真實 signal。這 6 個 critical 失敗就是 Stage 2 明確的待修清單:三層 JTBD(functional / emotional / social)、B2B 組織層 Jobs、B2B buyer vs user persona 分離、Discovery scope 守備、pre-mortem leading-indicator 紀律。逐項細節見 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md)。
+**Harness 改進住在 `evals/` 與 `.github/workflows/`,不會發到 npm。** 版本不需要再往 v1.2.9 之上 bump(v1.2.9 已經包含 user-level hook 與 evals 10/11/12 的 scope 調整)。
+**5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko)。
 ---
 ## 🧪 開發與評測
@@ -486,7 +513,7 @@ Claude Code 會自動：
 **本地（免費，推薦）**：用 `claude` CLI 搭配你的 Claude Pro/Max 訂閱（先 `claude login` 一次）跑這些 script。不需要 API key、沒有額外成本。整套 eval 系統就是設計來在每次發版前本地跑一遍。
-**CI（選用，付費）**：`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`（含 `package.json` 變動）時跑這兩套，把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 需要 `ANTHROPIC_API_KEY` secret（GitHub Actions 在 headless 容器無法走 OAuth）；沒設 secret 時 eval job **會乾淨地 skip**（灰色 ⏭️），不會出現誤導的紅叉。
+**CI（選用，不額外計費）**：`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`（含 `package.json` 變動）時跑這兩套，把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 同樣走你的 Claude Pro/Max 訂閱（不需 API key、沒有按 token 計費的成本）：一次性設定為本機 `claude setup-token` 產生長期 token，把它加進 repo secret `CLAUDE_CODE_OAUTH_TOKEN`。沒設 secret 時 eval job **會乾淨地 skip**（灰色 ⏭️），不會出現誤導的紅叉。
 ### 本地執行
@@ -507,7 +534,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # 只報告，不 exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-本地預設 `--runs 3`（多數決可吸收 LLM 變異性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），沒有按 token 計費的成本。CI 用 `--runs 1` 並需要 `ANTHROPIC_API_KEY` secret。
+本地預設 `--runs 3`（多數決可吸收 LLM 變異性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），沒有按 token 計費的成本。CI 用 `--runs 1`，靠同一個訂閱透過 `CLAUDE_CODE_OAUTH_TOKEN` secret 認證（用 `claude setup-token` 一次性產生）。
 ### Severity 與計分

package/SKILL.md CHANGED Viewed

@@ -212,6 +212,20 @@ Task(
 **Genuine false-positive exception**: if the prompt has no real connection to a specialist's scope (e.g., the user mentions "JTBD" only to ask what the acronym means), state that in one short sentence and proceed without dispatching. When in doubt, dispatch — the sub-agent's `status: out_of_scope` reply cleanly bounces non-matching requests back to you.
+### Reference fallback when Task dispatch is unavailable
+Some environments cannot dispatch sub-agents (notably `claude -p` headless runs, some MCP harnesses, and certain CI eval contexts). In those environments the `Task` tool is absent or inert, so the dispatch above will silently inline-collapse. To prevent content collapse, **before producing inline output for any matched trigger row, you MUST read the corresponding reference files and treat their Hard Gates as your own**:
+| Specialist (if dispatch fails / unavailable) | Reference files to read FIRST, then satisfy Hard Gates inline |
+|---|---|
+| `discovery-specialist` | `references/02a-persona.md` (Persona structure + B2B Buyer/User Hard Gate + B2B Prioritization vocabulary) AND `references/02b-jtbd.md` (3-layer JTBD + B2B Org-Level Jobs Hard Gates) AND `references/rules-quality-review.md` (✅/❌ marker format + ≥1 ❌ Hard Gate). Add `references/02c-ost-journey.md` if the request includes OST or Journey Map. |
+| `strategy-critic` | `references/01-strategy.md` (Rumelt diagnosis + three-questions critique format) AND `references/rules-quality-review.md` |
+| `pre-mortem-runner` | `references/04-develop.md` (Pre-mortem section — 15+ scenarios across 5 categories + leading-indicator format) AND `references/rules-quality-review.md` |
+**Quality self-review is always required.** Whenever the user prompt asks for a quality self-review, checklist, or step-end critique — or whenever you are about to emit step-end output of any kind — you MUST have read `references/rules-quality-review.md` and follow its exact `✅`/`❌` marker format with at least one `❌` on a substantive content gap. This is non-negotiable regardless of whether dispatch was attempted or whether the fallback path was used.
+This is **not** a license to skip dispatch when it IS available. The order is: (1) attempt dispatch; (2) if the Task tool is unavailable or the call cannot complete, read the listed references and produce specialist-grade output inline; (3) cite that you used the inline fallback in one short note at the end ("Inline fallback used — Task dispatch unavailable in this environment."). The references above embed the same Hard Gates the specialist would have enforced, so following them faithfully closes the quality gap.
 Full per-trigger invocation templates: `references/rules-subagent-dispatch.md`. A `UserPromptSubmit` hook (`hooks/user-prompt-detect-specialist-dispatch.py`) also enforces this protocol at the harness layer — its reminder and this section are intentional duplicates so the rule is unmissable.
 ---

package/i18n/es/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # Etapa 1: Descubrimiento — Construyendo Personas
+### 🚫 Alcance de Salida de Descubrimiento (Hard Gate)
+Cuando el orchestrator recibe el pedido de ejecutar trabajo de Descubrimiento (Persona, JTBD, OST, Journey Map, Descubrimiento Continuo), la salida debe **mantenerse dentro del alcance de Descubrimiento**. Descubrimiento responde "quién es el usuario" y "qué necesidad insatisfecha intenta cubrir" — nada más. Los siguientes artefactos de etapas downstream **NO deben aparecer** en un entregable de Descubrimiento, incluso si se siente natural mencionarlos:
+- **Artefactos de etapa Define**: declaración de positioning, preguntas HMW (How Might We), matrices de pain points que doblan como prompts de solución
+- **Artefactos de etapa Develop**: borradores de PR-FAQ, escenarios de pre-mortem, tablas RICE, definición de scope de MVP, secciones de PRD, listas de features
+- **Artefactos de etapa Deliver**: definición de métrica North Star, criterios de PMF, plan GTM, bloques de business-model canvas, tablas de product spec
+- **Artefactos de etapa Strategy**: Strategy Blocks, diagnosis / guiding-policy / coherent-action de Rumelt, descomposición de DHM Model, escalas OKR
+Si los hallazgos de Descubrimiento sugieren fuertemente un artefacto downstream (ej. el análisis JTBD revela un ángulo de positioning claro), regístralo como una **open question o next-step pointer de una línea** al final — pero **NO produzcas el artefacto en sí**. La siguiente etapa tiene su propio step dedicado.
+Ejemplo no aceptable: terminar un análisis JTBD con una tabla RICE poblada, una lista de scope de MVP, o un párrafo "Recommended Positioning" — incluso si todas las otras sub-secciones de Descubrimiento están correctas, esta salida FALLA este Hard Gate.
+---
 ## Hábitos de Descubrimiento Continuo (Teresa Torres)
 Construye un hábito clave: **Habla con al menos un usuario objetivo cada semana.** El descubrimiento no es un ritual único — es un sistema continuo.
@@ -10,6 +25,17 @@ Construye un hábito clave: **Habla con al menos un usuario objetivo cada semana
 Las Personas no se segmentan por edad y género, sino por **propósito / tarea / motivación** para distinguir diferentes tipos de usuarios.
+### 🏢 Hard Gate B2B — Persona Buyer ≠ Persona User
+Para cualquier producto B2B (o B2B2C), el **Buyer** (firma el contrato, controla el presupuesto, asume riesgo de vendor) y el **User diario** (toca el producto todos los días) son casi siempre roles distintos con **objetivos, pain points y criterios de decisión diferentes**. Tratarlos como una sola Persona colapsa dos Jobs distintos en un arquetipo borroso y el análisis resultante no puede impulsar decisiones de producto.
+Regla del Hard Gate:
+- Producir **dos bloques de Persona separados** etiquetados `Buyer` y `User` cuando el producto es B2B y los dos roles son distintos (suposición por defecto en B2B).
+- Si son la misma persona (raro — usualmente herramientas fundador-led o B2B de un solo dueño), explicá en una oración por qué el buyer también es el user diario en este escenario específico.
+- Cross-link entre las dos Personas: notá dónde el criterio de evaluación del Buyer depende de lo que el User realmente hace a diario (ej. "el criterio de audit-readiness del Buyer depende de que el User complete el formulario el mismo día y no en lote").
+Ejemplo no aceptable: producir una sola Persona ("HR Manager") que fusiona aprobar presupuesto Y completar formularios diarios — dos Jobs distintos forzados en un arquetipo borroso. Esa salida FALLA este Hard Gate.
 ```
 | Campo | Persona 1: [Apodo] | Persona 2: [Apodo] | Persona 3: [Apodo] |
 |---|---|---|---|
@@ -24,6 +50,22 @@ Las Personas no se segmentan por edad y género, sino por **propósito / tarea /
 Explica la lógica de segmentación; verifica MECE (mutuamente excluyente, colectivamente exhaustivo); identifica el TA primario y secundario.
+### 🎯 Reasoning de Priorización de Persona (Hard Gate)
+Decir solo "identificar TA primario" sin un reasoning explícito falla este Hard Gate. La declaración de priorización debe nombrar una Persona como primaria Y explicar por qué **en términos específicos a la dinámica go-to-market del producto** — no claims genéricos de "frecuencia de uso".
+Para **productos B2B con múltiples user personas**, el reasoning DEBE referenciar **al menos una** de estas dinámicas específicas B2B por nombre (usando estos términos o equivalentes claramente análogos):
+- **Champion vs Buyer** — quién aboga internamente por la adopción versus quién firma el contrato; la adopción champion-led suele ganar la priorización B2B incluso cuando el buyer es la persona "más senior"
+- **Adoption multiplier** — quién, al adoptar, desbloquea la adopción para el resto de la org (ej. el uso diario del HR Specialist siembra el system-of-record del que otras personas dependen después)
+- **Switching-trigger ownership** — qué persona siente el dolor que justifica cambiar de la herramienta incumbente; quien posee el switching trigger es el candidato a priorización incluso si no es el usuario más pesado
+- **Budget authority** — quién controla la línea de presupuesto; relevante cuando buyer ≠ user y los criterios del buyer dominan la decisión inicial del deal
+- **Audit / compliance pressure ownership** — el rol de quién está en juego cuando aparecen hallazgos de auditoría; las personas presionadas por compliance suelen dominar la priorización en segmentos B2B regulados
+Un reasoning puro de "Persona X la usa más" o "Persona Y tiene más usuarios" FALLA este Hard Gate para productos B2B. La frecuencia es necesaria pero nunca suficiente — el switching B2B es impulsado por presión organizacional, no por tasas de uso individual.
+Para **productos B2C**, el reasoning debe referenciar al menos uno de: switching-trigger ownership, diferencial de severidad JTBD, network-effect seeding, o diferencial de willingness-to-pay. El reasoning puramente por frecuencia también falla para B2C.
 ### 📝 Lista de Verificación de Calidad de Persona
 - ✅ ¿La segmentación está basada en "propósito/tarea/motivación" en lugar de datos demográficos?
 - ✅ ¿Las Personas son MECE (mutuamente excluyentes y colectivamente exhaustivas del mercado objetivo)?

package/i18n/es/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > "La unidad de análisis no es el consumidor, sino el trabajo que el consumidor está tratando de realizar." — Clayton Christensen
+**Cobertura JTBD de Tres Capas (Hard Gate — las tres capas requeridas):**
+Cada análisis JTBD DEBE hacer aflorar **las tres capas explícitamente**: **Funcional** (la tarea que se está completando), **Emocional** (cómo el usuario quiere sentirse durante/después), y **Social** (cómo el usuario quiere ser percibido). Producir solo la capa Funcional es el fallo más común en JTBD — los Jobs Emocionales y Sociales suelen ser los verdaderos disparadores de switching, especialmente en B2B. Si una Persona dada genuinamente no tiene un Job Emocional o Social significativo para el producto, decilo explícitamente con una oración de reasoning en lugar de omitir silenciosamente la fila.
 **Forma Canónica JTBD (Hard Gate — se requiere estructura de tres cláusulas):**
 Cada declaración JTBD (Primary, Funcional, Emocional, Social — cada capa) DEBE escribirse como una oración completa de tres cláusulas en la forma canónica. Las tres cláusulas son obligatorias:
@@ -68,11 +72,13 @@ Claude debe autoevaluar después de producir el output JTBD (cada ítem debe mar
 ---
-### 🏢 Requisitos de Profundización para Productos B2B
+### 🏢 Requisitos de Profundización para Productos B2B (Hard Gate)
-**Productos B2B (incluyendo B2B2C) deben completar el siguiente análisis:**
+**Hard Gate — para cualquier producto B2B (o B2B2C), los siguientes tres sub-análisis son TODOS obligatorios. Saltarse cualquiera es una contract failure, sin importar si el usuario lo pidió explícitamente.** Si el tipo de producto es ambiguo, hacé una pregunta de clarificación; no asumas silenciosamente B2C.
-#### Análisis de Jobs a Nivel Organizacional (Obligatorio — cubrir al menos 2 niveles)
+#### Análisis de Jobs a Nivel Organizacional (Hard Gate — cubrir al menos 2 niveles)
+Un análisis JTBD B2B que se queda puramente al nivel de usuario individual FALLA este gate. Los Jobs a nivel organizacional (auditoría de cumplimiento, flujos de aprobación cross-departamentales, control de costos, alineación de políticas de headcount, integridad de pista de auditoría) son necesidades que existen más allá de la tarea diaria de cualquier usuario individual y rutinariamente dominan las decisiones de switching B2B. La tabla de abajo DEBE producirse y al menos 2 de los 3 niveles DEBEN contener Jobs específicos de B2B (no enunciados genéricos de productividad).
 | Nivel | Descripción | Ejemplos |
 |-------|-------------|----------|
@@ -80,12 +86,13 @@ Claude debe autoevaluar después de producir el output JTBD (cada ítem debe mar
 | **Job Operacional** | Necesidades de coordinación a nivel proceso/gerente de departamento | Gestión de flujo de aprobaciones, sincronización de información entre equipos |
 | **Job de Tarea** | Necesidades operativas diarias de usuarios individuales | Llenar formularios, verificar estados, exportar reportes |
-#### Análisis Comprador vs. Usuario (Obligatorio)
+#### Análisis Comprador (Buyer) vs. Usuario (User) (Hard Gate)
+El comprador de un producto B2B (firma el contrato, controla el presupuesto) y el usuario diario (toca el producto todos los días) son casi siempre dos roles, **correspondientes a Jobs diferentes**. Tratarlos como una sola Persona es el fallo más común en Descubrimiento B2B. Regla del Hard Gate:
-Si el comprador y el usuario son personas diferentes, analiza sus JTBD por separado:
-- **Job del Comprador**: Jobs que influyen la decisión de compra (justificación de ROI, reducción de riesgos, requisitos de cumplimiento)
-- **Job del Usuario**: Jobs que necesitan realizarse durante operaciones diarias (mejoras de eficiencia, reducción de errores)
-- Si son la misma persona, explica "por qué el tomador de decisiones es también el usuario en este escenario"
+- Si buyer ≠ user (suposición por defecto en B2B), producir **dos bloques separados de Persona+JTBD**: uno para el Buyer (justificación de ROI, reducción de riesgo, compliance, consolidación de vendors, audit-readiness), y uno para el User (eficiencia, reducción de errores, contexto de uso diario). Y cross-link: notá dónde el Job del Buyer depende del Job del User (ej. "el Job de compliance del Buyer depende de que el User realmente complete el reporte cada ciclo, no de que lo haga en batch al final del mes").
+- Si buyer = user (raro — usualmente herramientas fundador-led), explicá en una oración por qué en este escenario específico el tomador de decisiones también es el usuario diario — no lo asumas silenciosamente.
+- Ejemplo no aceptable: producir una sola Persona ("HR Manager") que fusiona la autoridad de aprobar presupuesto Y el llenado diario de formularios. Eso colapsa dos Jobs distintos en un rol borroso y el análisis no puede impulsar decisiones de producto.
 #### Cinco Preguntas de Profundización — Versión Mejorada B2B

package/i18n/ja/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # ステージ1：ディスカバリー — ペルソナ構築
+### 🚫 ディスカバリー出力スコープ(Hard Gate)
+orchestrator がディスカバリー作業(ペルソナ、JTBD、OST、Journey Map、Continuous Discovery)を実行するよう要求されたとき、出力は**ディスカバリースコープ内に留まる**必要があります。ディスカバリーは「ユーザーは誰か」「彼らが満たそうとしている未充足ニーズは何か」に答える — それ以外は対象外です。次の下流ステージの成果物は、たとえ自然に思えても**ディスカバリー成果物に含めてはなりません**:
+- **Define ステージ成果物**:ポジショニングステートメント、HMW(How Might We)質問、解法プロンプトを兼ねるペインポイントマトリクス
+- **Develop ステージ成果物**:PR-FAQ ドラフト、pre-mortem シナリオ、RICE テーブル、MVP スコープ定義、PRD セクション、機能リスト
+- **Deliver ステージ成果物**:North Star メトリック定義、PMF 基準、GTM 計画、ビジネスモデル要素、製品仕様表
+- **Strategy ステージ成果物**:Strategy Blocks、Rumelt の diagnosis / guiding-policy / coherent-action、DHM Model 分解、OKR ハイアラキー
+ディスカバリーの発見が下流の成果物を強く示唆する場合(例:JTBD 分析から明確なポジショニング角度が浮上)、文末に**一行の open question または next-step pointer** として記録 — 自分でその成果物を生成しないでください。次のステージにはそれ専用のステップがあります。
+不合格例:JTBD 分析の末尾に埋められた RICE テーブル、MVP スコープリスト、または「Recommended Positioning」段落を付ける — 他のディスカバリーサブセクションが正しくても、この出力は Hard Gate に FAIL します。
+---
 ## Continuous Discovery Habits（Teresa Torres）
 1つの重要な習慣を構築してください：**毎週少なくとも1人のターゲットユーザーと話す。** ディスカバリーは一回限りの儀式ではなく、継続的なシステムです。
@@ -10,6 +25,17 @@
 ペルソナは年齢や性別で分類するのではなく、**目的 / タスク / モチベーション**で異なるタイプのユーザーを区別します。
+### 🏢 B2B Hard Gate — Buyer ペルソナ ≠ User ペルソナ
+すべての B2B(または B2B2C)製品において、**Buyer**(契約締結、予算管理、ベンダーリスク保有)と**日常 User**(毎日製品に触れる)はほぼ常に**目標、ペインポイント、意思決定基準が異なる**別々の役割です。これらを 1 つのペルソナにまとめると、2 つの異なる Job を 1 つの曖昧なアーキタイプに押し込めることになり、分析結果は製品意思決定を駆動できません。
+Hard Gate ルール:
+- B2B では **`Buyer` と `User` というラベルの 2 つの別個のペルソナブロック**を生成するのがデフォルト(2 つの役割が明確に異なるとき = B2B のデフォルト想定)。
+- 同一人物の場合(例外的 — 通常は創業者主導ツールまたは個人事業主 B2B)、「なぜこのシナリオで Buyer が日常 User でもあるのか」を 1 文で明示してください。
+- 2 つのペルソナ間でクロスリンク:Buyer の評価基準が User の日常行動にどこで依存するかをメモ(例:「Buyer の監査準備基準は、User が当日にフォームを記入するかどうかに依存 — 後追いでまとめて記入するのではなく」)。
+不合格例:1 つのペルソナ(「HR マネージャー」)のみを生成し、「予算承認」と「毎日の休暇申請フォーム記入」という 2 つの異なる Job を 1 つの曖昧なアーキタイプに押し込める — この出力は Hard Gate に FAIL します。
 ```
 | フィールド | ペルソナ1：[ニックネーム] | ペルソナ2：[ニックネーム] | ペルソナ3：[ニックネーム] |
 |---|---|---|---|
@@ -24,6 +50,22 @@
 セグメンテーションロジックを説明し、MECE（相互排他的で網羅的）であるか確認し、プライマリTAとセカンダリTAを特定してください。
+### 🎯 ペルソナ優先順位付け reasoning(Hard Gate)
+「プライマリ TA を特定する」とだけ言って具体的な reasoning がないのは Hard Gate に不合格です。優先順位付けのステートメントは**1 つの**ペルソナをプライマリとして指定し、なぜそうなのかを**その製品の go-to-market ダイナミクスに固有の言語**で説明する必要があります — 一般的な「使用頻度が高い」というような理由ではなく。
+複数の user ペルソナを持つ **B2B 製品**では、reasoning は次の B2B 固有ダイナミクスのうち**少なくとも 1 つ**を名前で参照する必要があります(これらの用語または明らかに等価な概念を使用):
+- **Champion vs Buyer** — 組織内で誰が採用を擁護するか vs 誰が契約に署名するか;champion-led adoption は通常 B2B 優先順位付けで勝つ、buyer が「より上位」のペルソナであっても
+- **Adoption multiplier** — 誰の採用が組織全体の展開を解除するか(例:HR Specialist の毎日の使用は他のペルソナが後で依存する system-of-record の種を蒔く)
+- **Switching-trigger ownership** — どのペルソナが既存ツールからの切替えを正当化する痛みを感じているか;switching trigger を所有するペルソナは、最重使用者でなくても優先順位付け候補
+- **Budget authority** — 予算項目を誰が管理するか;buyer ≠ user の場合に関連、buyer の評価基準が初期取引決定を支配
+- **Audit / compliance pressure ownership** — 監査所見が発生した際に誰の役職がリスクにさらされるか;規制対象 B2B セグメントでは、コンプレッシャーを背負うペルソナが優先順位付けを支配することが多い
+純粋な「ペルソナ X は使用頻度が高い」または「ペルソナ Y はユーザー数が多い」という reasoning は B2B 製品にとってこの Hard Gate に FAIL します。頻度は必要条件であり、決して十分ではありません — B2B 切替えは組織レベルのプレッシャーによって駆動され、個別の使用率ではありません。
+**B2C 製品**の場合、reasoning は少なくとも 1 つを参照する必要があります:switching-trigger ownership、JTBD severity differential、network-effect seeding、または willingness-to-pay differential。純頻度 reasoning は B2C にも FAIL します。
 ### 📝 ペルソナ品質チェックリスト
 - ✅ セグメンテーションは「目的/タスク/モチベーション」に基づいているか？（デモグラフィックスではなく）
 - ✅ ペルソナはMECE（相互排他的でターゲット市場を網羅）か？

package/i18n/ja/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > "分析の単位は消費者ではなく、消費者が達成しようとしているジョブである。" — Clayton Christensen
+**JTBD 三層カバレッジ(Hard Gate — 三層すべて必須):**
+すべての JTBD 分析は**三層を明示的に網羅**する必要があります:**Functional**(完了するタスク)、**Emotional**(過程中/完了後にどう感じたいか)、**Social**(他者にどう見られたいか)。Functional 層だけを生成するのは JTBD で最も一般的な失敗です — Emotional と Social Job が実際の切替えトリガーであることが多く、B2B では特にそうです。もしあるペルソナが製品に対して本当に Emotional または Social Job を持たない場合は、暗黙の省略ではなく一文で理由を明示してください。
 **JTBD 標準構文（Hard Gate — 三節構造を強制）：**
 すべての JTBD 文（Primary、Functional、Emotional、Social のいずれの層も）は、**「〜のとき、〜したい、それによって〜」** の三節構造で書く必要があります。三つの節のいずれが欠けても不可：
@@ -68,11 +72,13 @@ ClaudeはJTBD出力後にセルフチェックを行う必要があります（
 ---
-### 🏢 B2Bプロダクト深掘り要件
+### 🏢 B2Bプロダクト深掘り要件(Hard Gate)
-**B2Bプロダクト（B2B2Cを含む）は以下の分析を完了する必要があります：**
+**Hard Gate — すべての B2B(B2B2C 含む)プロダクトについて、以下の 3 つの下位分析はすべて必須。ユーザーが明示的に要求していなくても、いずれかを省略すると contract failure。** プロダクトタイプが不明確な場合は、まず 1 つ確認質問をしてください — 暗黙のうちに B2C を仮定しないこと。
-#### 組織レベルのジョブ分析（必須 — 少なくとも2レベルをカバー）
+#### 組織レベルのジョブ分析(Hard Gate — 少なくとも 2 レベルをカバー)
+B2B JTBD 分析が個別ユーザーレベルのみに留まる場合、このゲートに FAIL します。組織レベルの Job(コンプライアンス監査、部門横断承認ワークフロー、コスト管理、人材ポリシーアラインメント、監査証跡の完全性)は、いかなる個別ユーザーの日常タスクをも超えたニーズであり、B2B 切替え決定で支配的になることが日常茶飯事です。以下の表は必ず生成し、3 レベルのうち少なくとも 2 つに B2B 固有の Job(漠然とした生産性ステートメントではない)を含める必要があります。
 | レベル | 説明 | 例 |
 |-------|-------------|----------|
@@ -80,12 +86,13 @@ ClaudeはJTBD出力後にセルフチェックを行う必要があります（
 | **運用的ジョブ** | プロセス/部門マネージャーレベルの調整ニーズ | 承認ワークフロー管理、チーム間情報同期 |
 | **タスクジョブ** | 個々のユーザーの日常業務ニーズ | フォーム記入、ステータス確認、レポート出力 |
-#### 購入者 vs. ユーザー分析（必須）
+#### 購入者(Buyer)vs ユーザー(User)分析(Hard Gate)
+B2B プロダクトの購入者(契約締結、予算管理)と日常ユーザー(毎日プロダクトに触れる)は、ほぼ常に 2 つの役割、**異なる Job に対応**しています。これらを単一ペルソナとして扱うことは B2B ディスカバリーで最も一般的な失敗です。Hard Gate ルール:
-購入者とユーザーが異なる人物の場合、それぞれのJTBDを個別に分析：
-- **購入者のジョブ**：購買決定に影響するジョブ（ROI正当化、リスク低減、コンプライアンス要件）
-- **ユーザーのジョブ**：日常業務中に達成すべきジョブ（効率向上、エラー削減）
-- 同一人物の場合、「このシナリオで意思決定者が同時にユーザーである理由」を説明
+- buyer ≠ user の場合(B2B のデフォルト想定)、**2 つの別個のペルソナ+JTBD ブロック**を生成:1 つは Buyer 用(ROI 正当化、リスク低減、コンプライアンス、ベンダー統合、監査準備)、1 つは User 用(効率、エラー削減、日常使用コンテキスト)。さらにクロスリンク:Buyer の Job が User の Job にどこで依存するかをメモ(例:「Buyer のコンプライアンス Job は、月末まとめ記入ではなく、User が毎週本当に記入することに依存」)。
+- buyer = user の場合(例外的 — 創業者主導のツールなど)、なぜこの特定のシナリオで意思決定者が日常ユーザーでもあるのかを一文で明示 — 暗黙的に仮定しないこと。
+- 不合格例:1 つのペルソナ(「HR マネージャー」)のみを生成し、予算決定権と日常的なフォーム記入を 1 つの曖昧なアーキタイプに押し込める。これは 2 つの異なる Job を 1 つの漠然とした役割に潰し、分析が製品決定を駆動できなくなる。
 #### 深掘り5つの質問 — B2B強化版

package/i18n/ko/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # 1단계: 디스커버리 — Persona 구축
+### 🚫 디스커버리 출력 범위 (Hard Gate)
+orchestrator 가 디스커버리 작업(Persona, JTBD, OST, Journey Map, Continuous Discovery)을 수행하도록 요청받을 때, 출력은 **디스커버리 범위 안에 머물러야 합니다**. 디스커버리는 "사용자가 누구인가", "그들이 충족하려는 미충족 니즈는 무엇인가"에 답합니다 — 그 외에는 다루지 않습니다. 다음 하류 단계의 산출물은 자연스러워 보이더라도 **디스커버리 결과물에 절대 등장해서는 안 됩니다**:
+- **Define 단계 산출물**: positioning statement, HMW (How Might We) 질문, 솔루션 프롬프트 역할을 겸하는 페인포인트 매트릭스
+- **Develop 단계 산출물**: PR-FAQ 초안, pre-mortem 시나리오, RICE 테이블, MVP scope 정의, PRD 섹션, 기능 목록
+- **Deliver 단계 산출물**: North Star 메트릭 정의, PMF 기준, GTM 계획, 비즈니스 모델 블록, 제품 사양 표
+- **Strategy 단계 산출물**: Strategy Blocks, Rumelt 의 diagnosis / guiding-policy / coherent-action, DHM Model 분해, OKR 계층
+디스커버리 발견이 하류 산출물을 강력히 시사하는 경우(예: JTBD 분석에서 명확한 포지셔닝 각도 부상), 끝에 **한 줄짜리 open question 또는 next-step pointer**로 기록 — 직접 그 산출물을 만들지 마세요. 다음 단계에 전용 스텝이 있습니다.
+불합격 예: JTBD 분석 끝에 채워진 RICE 테이블, MVP scope 목록, 또는 "Recommended Positioning" 단락 추가 — 다른 디스커버리 하위 섹션이 모두 올바르더라도 이 출력은 Hard Gate 에 FAIL.
+---
 ## Continuous Discovery Habits (Teresa Torres)
 하나의 핵심 습관을 구축하세요: **매주 최소 한 명의 타겟 사용자와 대화하기.** 디스커버리는 일회성 의식이 아니라 지속적인 시스템입니다.
@@ -10,6 +25,17 @@
 Persona는 나이와 성별이 아닌 **목적 / 과업 / 동기**로 구분하여 다양한 유형의 사용자를 식별합니다.
+### 🏢 B2B Hard Gate — Buyer Persona ≠ User Persona
+모든 B2B (또는 B2B2C) 제품에서, **Buyer** (계약 체결, 예산 관리, 벤더 리스크 보유) 와 **일상 User** (매일 제품 사용) 는 거의 항상 **목표, 페인포인트, 의사결정 기준이 다른** 별개의 역할입니다. 이들을 하나의 Persona 로 합치는 것은 두 개의 다른 Job 을 하나의 모호한 아키타입에 강제로 욱여넣는 것이며, 분석 결과는 제품 의사결정을 이끌어낼 수 없습니다.
+Hard Gate 규칙:
+- B2B 에서는 두 역할이 명확히 다를 때 (B2B 기본 가정) `Buyer` 와 `User` 라벨이 붙은 **두 개의 별개 Persona 블록**을 생성하는 것이 기본.
+- 동일 인물인 경우 (드문 예외 — 보통 창업자 주도 도구 또는 1인 B2B), "왜 이 시나리오에서 Buyer 가 일상 User 이기도 한지" 한 문장으로 명시.
+- 두 Persona 간 크로스링크: Buyer 의 평가 기준이 User 의 일상 행동에 어디서 의존하는지 표기 (예: "Buyer 의 감사 준비 기준은 User 가 당일에 양식을 작성하느냐에 달려 있음 — 사후 일괄 작성이 아니라").
+불합격 예: 하나의 Persona ("HR 매니저") 만 생성하여 "예산 승인" 과 "매일 휴가 양식 작성" 이라는 두 개의 다른 Job 을 하나의 모호한 아키타입에 욱여넣음 — 이 출력은 Hard Gate 에 FAIL.
 ```
 | 항목 | Persona 1: [별칭] | Persona 2: [별칭] | Persona 3: [별칭] |
 |------|---|---|---|
@@ -24,6 +50,22 @@ Persona는 나이와 성별이 아닌 **목적 / 과업 / 동기**로 구분하
 세분화 논리를 설명하고; MECE (상호 배타적이고 전체를 아우르는지) 확인; 1차 TA와 2차 TA를 식별하세요.
+### 🎯 Persona 우선순위 reasoning (Hard Gate)
+"1차 TA 를 식별한다"고만 말하고 구체적 reasoning 이 없으면 이 Hard Gate 에 불합격입니다. 우선순위 진술은 **한 개의** Persona 를 1차로 지정하고, 왜 그런지 **해당 제품의 go-to-market 다이내믹스에 특화된 언어**로 설명해야 합니다 — 일반적인 "사용 빈도가 높다" 같은 이유가 아니라.
+여러 user persona 가 있는 **B2B 제품**에서, reasoning 은 다음 B2B 특유 다이내믹스 중 **최소 하나**를 이름으로 참조해야 합니다 (이 용어 또는 명확히 동등한 개념 사용):
+- **Champion vs Buyer** — 조직 내부에서 누가 채택을 옹호하느냐 vs 누가 계약에 서명하느냐; champion-led adoption 은 B2B 우선순위에서 보통 이기며, buyer 가 "더 시니어" 한 persona 라도 그렇다
+- **Adoption multiplier** — 누구의 채택이 조직 전체 확산을 잠금 해제하는가 (예: HR Specialist 의 일일 사용은 다른 persona 가 나중에 의존하는 system-of-record 의 씨앗을 뿌림)
+- **Switching-trigger ownership** — 어떤 persona 가 기존 도구에서 전환을 정당화하는 고통을 느끼는가; switching trigger 를 소유한 persona 는 최다 사용자가 아니어도 우선순위 후보
+- **Budget authority** — 누가 예산 항목을 통제하는가; buyer ≠ user 일 때 관련, buyer 의 평가 기준이 초기 거래 결정을 지배
+- **Audit / compliance pressure ownership** — 감사 발견이 발생할 때 누구의 역할이 위험에 처하는가; 규제 B2B segment 에서는 컴플라이언스 압력을 받는 persona 가 우선순위를 지배하는 경우가 많음
+순수한 "Persona X 는 사용 빈도가 높다" 또는 "Persona Y 는 사용자 수가 많다" 식 reasoning 은 B2B 제품에 대해 이 Hard Gate 에 FAIL. 빈도는 필요조건이지만 결코 충분조건이 아닙니다 — B2B 전환은 조직 수준의 압력에 의해 구동되지, 개별 사용률에 의해 구동되지 않습니다.
+**B2C 제품**의 경우, reasoning 은 다음 중 최소 하나를 참조해야 합니다: switching-trigger ownership, JTBD severity differential, network-effect seeding, willingness-to-pay differential. 순수 빈도 reasoning 은 B2C 에도 FAIL.
 ### 📝 Persona 품질 체크리스트
 - ✅ 세분화가 인구통계가 아닌 "목적/과업/동기" 기반인가?
 - ✅ Persona가 MECE인가 (타겟 시장에서 상호 배타적이고 전체를 아우르는가)?

package/i18n/ko/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > "분석의 단위는 소비자가 아니라, 소비자가 완수하려는 과업(Job)입니다." — Clayton Christensen
+**JTBD 3계층 커버리지(Hard Gate — 3계층 모두 필수):**
+모든 JTBD 분석은 **3계층을 명시적으로 표면화**해야 합니다: **Functional**(완수할 과업), **Emotional**(과정 중/완료 후 어떻게 느끼고 싶은지), **Social**(타인에게 어떻게 인식되고 싶은지). Functional 계층만 생성하는 것이 JTBD 에서 가장 흔한 실패입니다 — Emotional 과 Social Job 이 실제 전환 트리거인 경우가 많고, B2B 에서는 특히 그렇습니다. 만약 어떤 Persona 가 제품에 대해 정말로 의미 있는 Emotional 또는 Social Job 이 없다면, 침묵하며 행을 누락하지 말고 한 문장의 reasoning 으로 명시하세요.
 **JTBD 표준 구문(Hard Gate — 3절 구조 강제):**
 모든 JTBD 문(Primary, Functional, Emotional, Social 어떤 계층이든)은 **"~할 때, ~하고 싶다, 그래서 ~"** 3절 구조의 완전한 문장으로 작성해야 합니다. 세 절 중 어느 하나라도 빠지면 안 됩니다:
@@ -68,11 +72,13 @@ Claude는 JTBD 산출물 작성 후 자체 점검해야 합니다 (각 항목은
 ---
-### 🏢 B2B 제품 심층 분석 요구사항
+### 🏢 B2B 제품 심층 분석 요구사항 (Hard Gate)
-**B2B 제품(B2B2C 포함)은 다음 분석을 완료해야 합니다:**
+**Hard Gate — 모든 B2B (B2B2C 포함) 제품에 대해, 아래 3개 하위 분석은 모두 필수입니다. 사용자가 명시적으로 요청하지 않았더라도 어느 하나라도 빠지면 contract failure 입니다.** 제품 유형이 불명확한 경우 먼저 한 가지 명료화 질문을 하세요 — 묵시적으로 B2C 로 가정하지 마세요.
-#### 조직 수준 Job 분석 (필수 — 최소 2개 수준 포함)
+#### 조직 수준 Job 분석 (Hard Gate — 최소 2개 수준 포함)
+B2B JTBD 분석이 개별 사용자 수준에만 머무르면 이 gate 에 FAIL. 조직 수준의 Job (규정 준수 감사, 부서 간 결재 워크플로우, 비용 통제, 인력 정책 정렬, 감사 추적 완전성) 은 개별 사용자의 일상 과업을 초월하는 니즈이며, B2B 전환 결정에서 일상적으로 지배적입니다. 아래 표는 반드시 생성되어야 하며, 3개 수준 중 최소 2개는 B2B 특유의 Job (모호한 생산성 진술이 아닌) 을 포함해야 합니다.
 | 수준 | 설명 | 예시 |
 |------|------|------|
@@ -80,12 +86,13 @@ Claude는 JTBD 산출물 작성 후 자체 점검해야 합니다 (각 항목은
 | **운영적 Job** | 프로세스/부서 관리자 수준의 조율 니즈 | 결재 워크플로우 관리, 팀 간 정보 동기화 |
 | **실무적 Job** | 개별 사용자의 일상 업무 니즈 | 양식 작성, 상태 확인, 보고서 내보내기 |
-#### 구매자 vs. 사용자 분석 (필수)
+#### 구매자(Buyer) vs 사용자(User) 분석 (Hard Gate)
+B2B 제품의 구매자 (계약 체결, 예산 관리) 와 일상 사용자 (매일 제품 사용) 는 거의 항상 두 개의 역할이며, **서로 다른 Job 에 대응**합니다. 이들을 단일 Persona 로 취급하는 것은 B2B 디스커버리에서 가장 흔한 실패입니다. Hard Gate 규칙:
-구매자와 사용자가 다른 사람인 경우, JTBD를 별도로 분석하세요:
-- **구매자 Job**: 구매 결정에 영향을 미치는 Job (ROI 정당화, 리스크 감소, 규정 준수 요구사항)
-- **사용자 Job**: 일상 업무 중 완수해야 하는 Job (효율성 향상, 오류 감소)
-- 동일 인물인 경우, "왜 이 시나리오에서 의사결정자가 사용자이기도 한지" 설명
+- buyer ≠ user 인 경우 (B2B 기본 가정), **두 개의 별개 Persona+JTBD 블록**을 생성: 하나는 Buyer 용 (ROI 정당화, 리스크 감소, 컴플라이언스, 벤더 통합, 감사 준비), 하나는 User 용 (효율, 오류 감소, 일상 사용 맥락). 그리고 크로스링크: Buyer 의 Job 이 User 의 Job 에 어디서 의존하는지 표기 (예: "Buyer 의 컴플라이언스 Job 은 월말 일괄 작성이 아닌, User 가 매주 실제로 기록하느냐에 달려 있음").
+- buyer = user 인 경우 (예외적 — 보통 창업자 주도 도구), 왜 이 특정 시나리오에서 의사결정자가 일상 사용자이기도 한지 한 문장으로 명시 — 묵시적으로 가정하지 마세요.
+- 불합격 예: 하나의 Persona ("HR 매니저") 만 생성하여 예산 의사결정 권한과 일상 양식 작성을 하나의 모호한 아키타입에 욱여넣음. 이는 두 개의 다른 Job 을 하나의 흐릿한 역할로 압축하며, 분석이 제품 결정을 이끌어낼 수 없게 됩니다.
 #### 심층 5가지 질문 — B2B 강화 버전

package/i18n/zh-CN/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # 阶段一：Discovery — Persona 建立
+### 🚫 Discovery 输出范畴（Hard Gate）
+当 orchestrator 被要求执行 Discovery 工作(Persona、JTBD、OST、Journey Map、Continuous Discovery)时,输出必须**停留在 Discovery 范畴内**。Discovery 回答「用户是谁」「他们想满足的未被满足需求」——仅此而已。下列下游阶段的产出绝对**不可**出现在 Discovery 交付物中,即使顺手感觉合理:
+- **Define 阶段产出**:positioning statement、HMW(How Might We)问题、可被解读为解法提示的痛点矩阵
+- **Develop 阶段产出**:PR-FAQ 草稿、pre-mortem 情境、RICE 表格、MVP scope 定义、PRD 段落、功能列表
+- **Deliver 阶段产出**:North Star 指标定义、PMF 判准、GTM 计划、商业模式区块、产品规格表
+- **Strategy 阶段产出**:Strategy Blocks、Rumelt 的 diagnosis / guiding-policy / coherent-action、DHM Model 拆解、OKR 阶层
+若 Discovery 发现明显指向某个下游产出(例如 JTBD 浮出清晰的 positioning 角度),可在文末以**一行 open question 或 next-step pointer** 记录——但**不要**自己产出该产出。下个阶段在规划流程中有专属步骤负责它。
+不合格范例:JTBD 分析结尾附上填好的 RICE 表、MVP scope 列表、或一段「Recommended Positioning」——即使前面 Discovery 子段都正确,这个输出仍 FAIL 此 Hard Gate。
+---
 ## Continuous Discovery 习惯（Teresa Torres）
 建立一个关键习惯：**每周至少接触一位目标用户**。Discovery 不是一次性仪式，而是持续系统。
@@ -10,6 +25,17 @@
 Persona 不是用年龄性别来分群，而是用「用途 / 任务 / 动机」来区分不同类型的用户。
+### 🏢 B2B Hard Gate — 买方 Persona ≠ 使用者 Persona
+对于任何 B2B(或 B2B2C)产品,**买方(Buyer)**(签合约、掌预算、扛供应商风险)与**日常使用者(User)**(每天接触产品)几乎都是**目标、痛点、决策准则完全不同**的两个角色。把他们合并成同一个 Persona 等于把两个不同的 Job 强塞进一个模糊原型,分析结果无法驱动产品决策。
+Hard Gate 规则:
+- B2B 预设要产出**两个独立的 Persona 区块**,标示为 `Buyer` 和 `User`,当两个角色明显不同(B2B 的预设假设)。
+- 若为同一个人(少数例外——通常是创办人主导的工具或独资 B2B),请用一句话明确说明「为何此场景中买方就是日常使用者」。
+- 两个 Persona 之间要交叉连结:标出买方的评估准则何处依赖使用者的日常行为(例如「买方的稽核就绪准则取决于使用者是否当天填假单,而不是事后补登」)。
+不合格范例:只产出一个 Persona(「HR 经理」),把「核定预算」和「每天填假单」这两个不同的 Job 硬塞进同一个模糊原型——这个输出 FAIL 此 Hard Gate。
 ```
 | 栏位 | Persona 1: [暱称] | Persona 2: [暱称] | Persona 3: [暱称] |
 |---|---|---|---|
@@ -24,6 +50,22 @@ Persona 不是用年龄性别来分群，而是用「用途 / 任务 / 动机」
 说明切分逻辑；检查是否 MECE（互斥且完整覆盖）；指出核心 TA 和次要 TA。
+### 🎯 Persona 优先排序 reasoning（Hard Gate）
+只说「指出核心 TA」而没有具体 reasoning 不符合此 Hard Gate。优先排序的陈述必须指出**一个** Persona 为核心,并用**该产品 go-to-market 动态的具体语言**解释为什么——而不是泛泛的「使用频率高」这类理由。
+**B2B 产品**若有多个 user persona,reasoning 必须**至少引用一个**下列 B2B 专属动态(用这些词汇或显然等价的概念):
+- **Champion vs Buyer** — 谁在组织内倡导采用 vs 谁签合约;champion-led adoption 通常在 B2B 优先序中胜出,即使 buyer 是「更资深」的 persona
+- **Adoption multiplier** — 谁的采用会解锁整个组织的扩散(例如 HR Specialist 每日使用会种下其他 persona 后续依赖的 system-of-record)
+- **Switching-trigger ownership** — 哪个 persona 感受到让组织从既有工具切换的痛;拥有 switching trigger 的 persona 即使不是最重使用者,也是优先序候选
+- **Budget authority** — 谁掌控预算项目;当 buyer ≠ user 时相关,买方的评估准则主导初始成交决策
+- **Audit / compliance pressure ownership** — 稽核发现出事时谁的角色受影响;在 regulated B2B segment 中,承受合规压力的 persona 通常主导优先序
+纯粹「Persona X 使用频率更高」或「Persona Y 用户数更多」的 reasoning 对 B2B 产品 FAIL 此 Hard Gate。频率是必要条件、非充分条件——B2B 切换由组织压力驱动,不是个别使用率。
+**B2C 产品**的 reasoning 至少引用一个:switching-trigger ownership、JTBD severity differential、network-effect seeding、willingness-to-pay differential。纯频率 reasoning 对 B2C 也 FAIL。
 ### 📝 Persona 品质自检清单
 - ✅ 切分是否基于「用途/任务/动机」而非人口统计？
 - ✅ 各 Persona 之间是否 MECE（互斥且完整覆盖目标市场）？

package/i18n/zh-CN/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > 「分析的单位不是消费者，而是消费者试图完成的那件工作。」— Clayton Christensen
+**JTBD 三层覆盖(Hard Gate — 三层皆必填):**
+每一份 JTBD 分析都必须**明确呈现三层**:**Functional**(要完成的任务)、**Emotional**(过程中/完成后想要感受的情绪)、**Social**(希望被他人看见的形象)。只产出 Functional 层是 JTBD 最常见的失误——Emotional 与 Social Job 往往才是真正的切换触发点,B2B 尤其。如果某 Persona 真的对该产品没有明确的 Emotional 或 Social Job,请用一句话明说理由——不要静默省略整列。
 **JTBD 标准句型（Hard Gate — 强制三段式）：**
 每一条 JTBD（不论是 Primary、Functional、Emotional、Social 任何一层）都必须写成完整的「**当 ... 我想要 ... 以便 ...**」三段式语句，三个子句缺一不可：
@@ -68,11 +72,13 @@ Claude 产出 JTBD 后必须自我检查（每项必须标记 ✅ 或 ❌，❌
 ---
-### 🏢 B2B 产品专用深度要求
+### 🏢 B2B 产品专用深度要求(Hard Gate)
-**B2B 产品（含 B2B2C）必须完成以下分析：**
+**Hard Gate — 任何 B2B(含 B2B2C)产品,下列三个子分析皆为必填。漏任一个即算 contract failure,不论用户有没有明确要求。** 若产品类型不明确,请先问一个 clarification 问题;不要默默假设为 B2C。
-#### 组织层级 Job 分析（必填，至少覆盖 2 层）
+#### 组织层级 Job 分析(Hard Gate — 至少覆盖 2 层)
+B2B JTBD 分析若全停留在个人使用者层级,FAIL 此 gate。组织层级的 Job(合规稽核、跨部门审批流程、成本控制、人力政策对齐、稽核轨迹完整性)是超越任何单一使用者日常任务的需求,在 B2B 切换决策中经常占主导。下表必须产出,且 3 层中至少 2 层必须含具 B2B 特性的 Job(不是空泛的生产力陈述)。
 | 层级 | 说明 | 范例 |
 |------|------|------|
@@ -80,12 +86,13 @@ Claude 产出 JTBD 后必须自我检查（每项必须标记 ✅ 或 ❌，❌
 | **Operational Job** | 流程/部门主管的协调需求 | 审批流程管理、跨部门资讯同步 |
 | **Task Job** | 个别使用者的日常操作需求 | 填写表单、查询状态、汇出报表 |
-#### 买方（Buyer）vs 使用者（User）分析（必填）
+#### 买方(Buyer)vs 使用者(User)分析(Hard Gate)
+B2B 产品的买方(签合约、掌预算)和日常使用者(每天接触产品)几乎都是两个角色、**对应不同的 Job**。把他们当成单一 Persona 是 B2B Discovery 最常见的失误。Hard Gate 规则:
-若买方与使用者不同人，必须分别分析其 JTBD：
-- **买方 Job**：影响采购决策的工作（ROI 说明、风险降低、合规要求）
-- **使用者 Job**：日常操作时需要完成的工作（效率提升、错误减少）
-- 若同一人，说明「为何此场景中决策者即使用者」
+- 若 buyer ≠ user(B2B 预设假设),请产出**两个独立的 Persona+JTBD 区块**:一个给 Buyer(ROI 说明、风险降低、合规、供应商整并、稽核就绪),一个给 User(效率、减错、日常使用情境)。并交叉连结:标出买方的 Job 何处依赖使用者的 Job(例如「买方的合规 Job 取决于使用者真的每周填报,而不是月底补登」)。
+- 若 buyer = user(少数例外,例如创办人主导工具),请用一句话明确说明为何此特定场景下决策者即日常使用者——不要默默假设。
+- 不合格范例:只产出一个 Persona(「HR 经理」),把预算决策权与日常填表硬塞进同一个模糊原型。这把两个不同的 Job 压成一个含糊角色,分析无法驱动产品决策。
 #### 深挖五问 B2B 强化版

package/i18n/zh-TW/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # 階段一：Discovery — Persona 建立
+### 🚫 Discovery 輸出範疇（Hard Gate）
+當 orchestrator 被要求執行 Discovery 工作（Persona、JTBD、OST、Journey Map、Continuous Discovery）時,輸出必須**停留在 Discovery 範疇內**。Discovery 回答「用戶是誰」「他們想滿足的未被滿足需求」——僅此而已。下列下游階段的產出絕對**不可**出現在 Discovery 交付物中,即使順手感覺合理:
+- **Define 階段產出**:positioning statement、HMW(How Might We)問題、可被解讀為解法提示的痛點矩陣
+- **Develop 階段產出**:PR-FAQ 草稿、pre-mortem 情境、RICE 表格、MVP scope 定義、PRD 段落、功能列表
+- **Deliver 階段產出**:North Star 指標定義、PMF 判準、GTM 計畫、商業模式區塊、產品規格表
+- **Strategy 階段產出**:Strategy Blocks、Rumelt 的 diagnosis / guiding-policy / coherent-action、DHM Model 拆解、OKR 階層
+若 Discovery 發現明顯指向某個下游產出(例如 JTBD 浮出清晰的 positioning 角度),可在文末以**一行 open question 或 next-step pointer** 紀錄——但**不要**自己產出該產出。下個階段在規劃流程中有專屬步驟負責它。
+不合格範例:JTBD 分析結尾附上填好的 RICE 表、MVP scope 列表、或一段「Recommended Positioning」——即使前面 Discovery 子段都正確,這個輸出仍 FAIL 此 Hard Gate。
+---
 ## Continuous Discovery 習慣（Teresa Torres）
 建立一個關鍵習慣：**每週至少接觸一位目標用戶**。Discovery 不是一次性儀式，而是持續系統。
@@ -10,6 +25,17 @@
 Persona 不是用年齡性別來分群，而是用「用途 / 任務 / 動機」來區分不同類型的用戶。
+### 🏢 B2B Hard Gate — 買方 Persona ≠ 使用者 Persona
+對於任何 B2B（或 B2B2C）產品,**買方(Buyer)**(簽合約、掌預算、扛供應商風險)與**日常使用者(User)**(每天接觸產品)幾乎都是**目標、痛點、決策準則完全不同**的兩個角色。把他們合併成同一個 Persona 等於把兩個不同的 Job 強塞進一個模糊原型,分析結果無法驅動產品決策。
+Hard Gate 規則:
+- B2B 預設要產出**兩個獨立的 Persona 區塊**,標示為 `Buyer` 和 `User`,當兩個角色明顯不同(B2B 的預設假設)。
+- 若為同一個人(少數例外——通常是創辦人主導的工具或獨資 B2B),請用一句話明確說明「為何此場景中買方就是日常使用者」。
+- 兩個 Persona 之間要交叉連結:標出買方的評估準則何處依賴使用者的日常行為(例如「買方的稽核就緒準則取決於使用者是否當天填假單,而不是事後補登」)。
+不合格範例:只產出一個 Persona(「HR 經理」),把「核定預算」和「每天填假單」這兩個不同的 Job 硬塞進同一個模糊原型——這個輸出 FAIL 此 Hard Gate。
 ```
 | 欄位 | Persona 1: [暱稱] | Persona 2: [暱稱] | Persona 3: [暱稱] |
 |---|---|---|---|
@@ -24,6 +50,22 @@ Persona 不是用年齡性別來分群，而是用「用途 / 任務 / 動機」
 說明切分邏輯；檢查是否 MECE（互斥且完整覆蓋）；指出核心 TA 和次要 TA。
+### 🎯 Persona 優先排序 reasoning（Hard Gate）
+只說「指出核心 TA」而沒有具體 reasoning 不符合此 Hard Gate。優先排序的陳述必須指出**一個** Persona 為核心,並用**該產品 go-to-market 動態的具體語言**解釋為什麼——而不是泛泛的「使用頻率高」這類理由。
+**B2B 產品**若有多個 user persona,reasoning 必須**至少引用一個**下列 B2B 專屬動態(用這些詞彙或顯然等價的概念):
+- **Champion vs Buyer** — 誰在組織內倡導採用 vs 誰簽合約;champion-led adoption 通常在 B2B 優先序中勝出,即使 buyer 是「更資深」的 persona
+- **Adoption multiplier** — 誰的採用會解鎖整個組織的擴散(例如 HR Specialist 每日使用會種下其他 persona 後續依賴的 system-of-record)
+- **Switching-trigger ownership** — 哪個 persona 感受到讓組織從既有工具切換的痛;擁有 switching trigger 的 persona 即使不是最重使用者,也是優先序候選
+- **Budget authority** — 誰掌控預算項目;當 buyer ≠ user 時相關,買方的評估準則主導初始成交決策
+- **Audit / compliance pressure ownership** — 稽核發現出事時誰的角色受影響;在 regulated B2B segment 中,承受合規壓力的 persona 通常主導優先序
+純粹「Persona X 使用頻率更高」或「Persona Y 用戶數更多」的 reasoning 對 B2B 產品 FAIL 此 Hard Gate。頻率是必要條件、非充分條件——B2B 切換由組織壓力驅動,不是個別使用率。
+**B2C 產品**的 reasoning 至少引用一個:switching-trigger ownership、JTBD severity differential、network-effect seeding、willingness-to-pay differential。純頻率 reasoning 對 B2C 也 FAIL。
 ### 📝 Persona 品質自檢清單
 - ✅ 切分是否基於「用途/任務/動機」而非人口統計？
 - ✅ 各 Persona 之間是否 MECE（互斥且完整覆蓋目標市場）？

package/i18n/zh-TW/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > 「分析的單位不是消費者，而是消費者試圖完成的那件工作。」— Clayton Christensen
+**JTBD 三層覆蓋(Hard Gate — 三層皆必填):**
+每一份 JTBD 分析都必須**明確呈現三層**:**Functional**(要完成的任務)、**Emotional**(過程中/完成後想要感受的情緒)、**Social**(希望被他人看見的形象)。只產出 Functional 層是 JTBD 最常見的失誤——Emotional 與 Social Job 往往才是真正的切換觸發點,B2B 尤其。如果某 Persona 真的對該產品沒有明確的 Emotional 或 Social Job,請用一句話明說理由——不要靜默省略整列。
 **JTBD 標準句型（Hard Gate — 強制三段式）：**
 每一筆 JTBD（不論是 Primary、Functional、Emotional、Social 任何一層）都必須寫成完整的「**當 ... 我想要 ... 以便 ...**」三段式語句，三個子句缺一不可：
@@ -68,11 +72,13 @@ Claude 產出 JTBD 後必須自我檢查（每項必須標記 ✅ 或 ❌，❌
 ---
-### 🏢 B2B 產品專用深度要求
+### 🏢 B2B 產品專用深度要求(Hard Gate)
-**B2B 產品（含 B2B2C）必須完成以下分析：**
+**Hard Gate — 任何 B2B(含 B2B2C)產品,下列三個子分析皆為必填。漏任一個即算 contract failure,不論用戶有沒有明確要求。** 若產品類型不明確,請先問一個 clarification 問題;不要默默假設為 B2C。
-#### 組織層級 Job 分析（必填，至少覆蓋 2 層）
+#### 組織層級 Job 分析(Hard Gate — 至少覆蓋 2 層)
+B2B JTBD 分析若全停留在個人使用者層級,FAIL 此 gate。組織層級的 Job(合規稽核、跨部門審批流程、成本控制、人力政策對齊、稽核軌跡完整性)是超越任何單一使用者日常任務的需求,在 B2B 切換決策中經常占主導。下表必須產出,且 3 層中至少 2 層必須含具 B2B 特性的 Job(不是空泛的生產力陳述)。
 | 層級 | 說明 | 範例 |
 |------|------|------|
@@ -80,12 +86,13 @@ Claude 產出 JTBD 後必須自我檢查（每項必須標記 ✅ 或 ❌，❌
 | **Operational Job** | 流程/部門主管的協調需求 | 審批流程管理、跨部門資訊同步 |
 | **Task Job** | 個別使用者的日常操作需求 | 填寫表單、查詢狀態、匯出報表 |
-#### 買方（Buyer）vs 使用者（User）分析（必填）
+#### 買方(Buyer)vs 使用者(User)分析(Hard Gate)
+B2B 產品的買方(簽合約、掌預算)和日常使用者(每天接觸產品)幾乎都是兩個角色、**對應不同的 Job**。把他們當成單一 Persona 是 B2B Discovery 最常見的失誤。Hard Gate 規則:
-若買方與使用者不同人，必須分別分析其 JTBD：
-- **買方 Job**：影響採購決策的工作（ROI 說明、風險降低、合規要求）
-- **使用者 Job**：日常操作時需要完成的工作（效率提升、錯誤減少）
-- 若同一人，說明「為何此場景中決策者即使用者」
+- 若 buyer ≠ user(B2B 預設假設),請產出**兩個獨立的 Persona+JTBD 區塊**:一個給 Buyer(ROI 說明、風險降低、合規、供應商整併、稽核就緒),一個給 User(效率、減錯、日常使用情境)。並交叉連結:標出買方的 Job 何處依賴使用者的 Job(例如「買方的合規 Job 取決於使用者真的每週填報,而不是月底補登」)。
+- 若 buyer = user(少數例外,例如創辦人主導工具),請用一句話明確說明為何此特定場景下決策者即日常使用者——不要默默假設。
+- 不合格範例:只產出一個 Persona(「HR 經理」),把預算決策權與日常填表硬塞進同一個模糊原型。這把兩個不同的 Job 壓成一個含糊角色,分析無法驅動產品決策。
 #### 深挖五問 B2B 強化版

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "product-playbook",
-  "version": "1.2.9",
+  "version": "1.2.10",
   "description": "MUST use when user wants to plan or strategize a product/feature. 22 PM frameworks, 6 modes, from idea to dev handoff",
   "bin": {
     "product-playbook": "./install.sh"

package/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # Stage 1: Discovery — Building Personas
+### 🚫 Discovery Output Scope (Hard Gate)
+When the orchestrator is asked to perform Discovery work (Persona, JTBD, OST, Journey Map, Continuous Discovery), the output MUST stay **inside the Discovery scope**. Discovery answers _who the users are_ and _what unmet need they are trying to satisfy_ — nothing else. The following downstream artifacts must NOT appear in a Discovery deliverable, even if they feel natural to mention:
+- **Define-stage artifacts**: positioning statements, HMW (How Might We) questions, named pain-point matrices that double as solution prompts
+- **Develop-stage artifacts**: PR-FAQ drafts, pre-mortem scenarios, RICE tables, MVP scope definitions, PRD sections, feature lists
+- **Deliver-stage artifacts**: North Star metric definitions, PMF criteria, GTM plans, business-model canvas blocks, product-spec tables
+- **Strategy-stage artifacts**: Strategy Blocks, Rumelt diagnosis/guiding-policy/coherent-action, DHM Model breakdowns, OKR ladders
+If the Discovery findings strongly suggest a downstream artifact (e.g., the JTBD analysis surfaces a clear positioning angle), note it as a one-line *open question* or *next-step pointer* at the very end — never produce the artifact itself. The next stage in the planning flow has its own dedicated step for it.
+Failing example: ending a JTBD analysis with a populated RICE table, an MVP scope list, or a "Recommended Positioning" paragraph. That output FAILS this Hard Gate even if all the other Discovery sub-sections are correct.
+---
 ## Continuous Discovery Habits (Teresa Torres)
 Build one key habit: **Talk to at least one target user every week.** Discovery is not a one-time ritual — it's an ongoing system.
@@ -10,6 +25,17 @@ Build one key habit: **Talk to at least one target user every week.** Discovery
 Personas are not segmented by age and gender, but by **purpose / task / motivation** to distinguish different types of users.
+### 🏢 B2B Hard Gate — Buyer Persona ≠ User Persona
+For any B2B (or B2B2C) product, the **buyer** (signs the contract, controls budget, owns vendor risk) and the **daily user** (touches the product every day) are almost always different roles with **different goals, pain points, and decision criteria**. Treating them as one persona conflates two distinct Jobs and produces analysis that cannot drive product decisions.
+Hard Gate rule:
+- Produce **two separate Persona blocks** labeled `Buyer` and `User` whenever the product is B2B and the two roles are distinct (the default assumption).
+- If they are the same person (rare — usually founder-led tools or sole-proprietor B2B), state explicitly in one sentence WHY the buyer is also the daily user in this specific scenario.
+- Cross-link the two personas: note where the Buyer's evaluation criteria depend on what the User actually does daily (e.g., "Buyer's audit-readiness criterion depends on User completing the leave-request form on the same day rather than batching them").
+Failing example: producing only one persona ("HR Manager") that conflates approving budget AND filing daily leave forms — two different Jobs forced into one fuzzy archetype. That output FAILS this Hard Gate.
 ```
 | Field | Persona 1: [Nickname] | Persona 2: [Nickname] | Persona 3: [Nickname] |
 |---|---|---|---|
@@ -24,6 +50,22 @@ Personas are not segmented by age and gender, but by **purpose / task / motivati
 Explain the segmentation logic; check for MECE (mutually exclusive, collectively exhaustive); identify the primary TA and secondary TA.
+### 🎯 Persona Prioritization Reasoning (Hard Gate)
+Identifying "primary TA" without explicit reasoning fails this gate. The prioritization statement MUST name one Persona as primary AND explain why **in terms specific to the product's go-to-market dynamics**, not generic frequency-of-use claims.
+For **B2B products with multiple user personas**, the reasoning MUST reference **at least one** of these B2B-specific dynamics by name (using these or clearly equivalent terms):
+- **Champion vs Buyer** — who internally advocates for adoption versus who signs the contract; champion-led adoption usually wins B2B prioritization even when buyer is the "more senior" persona
+- **Adoption multiplier** — who, by adopting, unlocks adoption for the rest of the org (e.g., HR Specialist's daily use seeds the system-of-record other personas later depend on)
+- **Switching-trigger ownership** — which persona feels the pain that justifies switching from the incumbent tool; whoever owns the switching trigger is the prioritization candidate even if they aren't the heaviest user
+- **Budget authority** — who controls the line item; relevant when buyer ≠ user and the buyer's evaluation criteria dominate the initial-deal decision
+- **Audit / compliance pressure ownership** — whose role is on the line when audit findings hit; compliance-pressured personas often dominate prioritization in regulated B2B segments
+A pure "Persona X uses it more often" or "Persona Y has more users" reasoning FAILS this Hard Gate for B2B products. Frequency is necessary, never sufficient — B2B switching is driven by org-level pressure, not individual usage rates.
+For **B2C products**, the reasoning MUST reference at least one of: switching-trigger ownership, JTBD severity differential, network-effect seeding, or willingness-to-pay differential. Pure frequency-of-use reasoning also fails for B2C.
 ### 📝 Persona Quality Checklist
 - ✅ Is the segmentation based on "purpose/task/motivation" rather than demographics?
 - ✅ Are Personas MECE (mutually exclusive and collectively exhaustive of the target market)?

package/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > "The unit of analysis is not the consumer, but the job the consumer is trying to get done." — Clayton Christensen
+**JTBD Three-Layer Coverage (Hard Gate — all three layers required):**
+Every JTBD analysis MUST surface **all three layers explicitly**: **Functional** (the task being completed), **Emotional** (how the user wants to feel during/after), and **Social** (how the user wants to be perceived). Producing only the Functional layer is the most common JTBD failure — Emotional and Social Jobs are routinely the real switching triggers, especially in B2B. If a single Persona genuinely has no meaningful Emotional or Social Job for the product, state that explicitly with one sentence of reasoning rather than silently omitting the row.
 **JTBD Canonical Form (Hard Gate — three-clause structure required):**
 Every JTBD statement (Primary, Functional, Emotional, Social — every layer) MUST be written as a complete three-clause sentence in the canonical form. All three clauses are required:
@@ -74,11 +78,13 @@ Claude must self-check after producing JTBD output (each item must be marked ✅
 ---
-### 🏢 B2B Product Deep-Dive Requirements
+### 🏢 B2B Product Deep-Dive Requirements (Hard Gate)
-**B2B products (including B2B2C) must complete the following analysis:**
+**Hard Gate — for any B2B (or B2B2C) product, the following three sub-analyses are all REQUIRED. Skipping any of them is a contract failure regardless of whether the user explicitly asked.** If the product type is ambiguous, ask one clarification question; do not silently default to B2C.
-#### Organizational-Level Job Analysis (Required — cover at least 2 levels)
+#### Organizational-Level Job Analysis (Hard Gate — cover at least 2 levels)
+A B2B JTBD analysis that stays purely at the individual-user level FAILS this gate. Organizational-level Jobs (compliance auditing, cross-department approval workflows, cost control, headcount-policy alignment, audit-trail integrity) are needs that exist beyond any single user's daily task and routinely dominate B2B switching decisions. The table below MUST be produced and at least 2 of the 3 levels MUST contain non-empty B2B-specific Jobs (not generic productivity statements).
 | Level | Description | Examples |
 |-------|-------------|----------|
@@ -86,12 +92,13 @@ Claude must self-check after producing JTBD output (each item must be marked ✅
 | **Operational Job** | Coordination needs at the process/department manager level | Approval workflow management, cross-team information sync |
 | **Task Job** | Day-to-day operational needs of individual users | Filling out forms, checking status, exporting reports |
-#### Buyer vs. User Analysis (Required)
+#### Buyer vs. User Analysis (Hard Gate)
+For B2B products, the buyer (signs the contract, controls budget) and the daily user (touches the product every day) are almost always different roles with **different Jobs**. Treating them as one persona is the single most common B2B Discovery failure. Hard Gate rule:
-If the buyer and user are different people, analyze their JTBD separately:
-- **Buyer Job**: Jobs that influence the purchasing decision (ROI justification, risk reduction, compliance requirements)
-- **User Job**: Jobs that need to get done during daily operations (efficiency gains, error reduction)
-- If they are the same person, explain "why the decision-maker is also the user in this scenario"
+- If buyer ≠ user (default assumption for B2B), produce **two separate Persona+JTBD blocks**: one for the Buyer (ROI justification, risk reduction, compliance, vendor-consolidation, audit-readiness) and one for the User (efficiency, error reduction, day-in-the-life context). Cross-link them: note where the buyer's Job depends on the user's Job (e.g., "buyer's compliance Job depends on user actually filing the report each cycle").
+- If buyer = user (exceptional, e.g., founder-led tools), state explicitly in one sentence WHY the decision-maker is also the daily user in this specific scenario — do not assume.
+- Failing example: producing only one persona ("HR Manager") that conflates both budgeting authority and daily form-filling. That collapses two distinct Jobs into one fuzzy persona and the analysis cannot drive product decisions.
 #### Deep-Dive Five Questions — B2B Enhanced Version