npm - product-playbook - Versions diffs - 1.2.9 → 1.2.11 - Mend

product-playbook 1.2.9 → 1.2.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.es.md +29 -2
package/README.ja.md +29 -2
package/README.ko.md +29 -2
package/README.md +29 -2
package/README.zh-CN.md +29 -2
package/README.zh-TW.md +29 -2
package/SKILL.md +14 -0
package/i18n/es/references/02a-persona.md +42 -0
package/i18n/es/references/02b-jtbd.md +34 -15
package/i18n/ja/references/02a-persona.md +42 -0
package/i18n/ja/references/02b-jtbd.md +34 -15
package/i18n/ko/references/02a-persona.md +42 -0
package/i18n/ko/references/02b-jtbd.md +34 -15
package/i18n/zh-CN/references/02a-persona.md +42 -0
package/i18n/zh-CN/references/02b-jtbd.md +34 -15
package/i18n/zh-TW/references/02a-persona.md +42 -0
package/i18n/zh-TW/references/02b-jtbd.md +34 -15
package/package.json +1 -1
package/references/02a-persona.md +42 -0
package/references/02b-jtbd.md +34 -15

package/README.es.md CHANGED Viewed

@@ -479,6 +479,33 @@ Una iteración de reducción de tokens. Misma semántica del contenido del skill
 **Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko) preservando las traducciones existentes — el adelgazamiento estructural se aplicó de manera idéntica por idioma.
+### Iteración 7: Resiliencia del Harness de Evals (Sprint 1 + 2A, v1.2.9)
+Una iteración a nivel de harness, no a nivel de skill. La semántica del skill no cambió; lo que cambió es la *superficie que se mide*. Objetivo: hacer visible la línea base real de calidad desbloqueando 4 evals que venían produciendo veredictos 0/0 en silencio.
+**Sprint 1 — desbloquear los clusters no medibles (`d2023fb`, `cee67cb`):**
+Cuatro evals (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) venían produciendo 0 pass / 0 fail por corrida — indistinguibles de "sin problemas" en el puntaje agregado. Tres causas:
+1. **Sub-agents faltantes en CI headless** — CI instalaba el skill en `~/.claude/skills/` pero nunca copiaba `agents/*.md` a `~/.claude/agents/`. `claude -p` por lo tanto no podía despachar vía `Task`, y el orchestrator corría inline en silencio.
+2. **El hook de specialist-dispatch silencioso bajo `claude -p`** — los `hooks/` a nivel de plugin no se cargan en modo headless; sólo se cargan los UserPromptSubmit hooks de `~/.claude/settings.json` a nivel de usuario. CI ahora registra programáticamente el dispatch hook a nivel de usuario antes de cada corrida behavioral.
+3. **Timeouts de response + judge demasiado agresivos** — 180s response / 120s judge cortaban en medio las salidas largas de Discovery y Pre-mortem; el judge entonces veía un string truncado y emitía 0/0. Subido a 600s / 240s con un reintento ante salidas no-JSON.
+También se eliminaron las expectations procedurales tipo "el orchestrator delega vía Task tool" de los evals 10/11/12 — esas son inverificables en `claude -p` (sin superficie de Task anidado) y no son la propiedad que finalmente nos importa. Las expectations restantes apuntan a la *calidad de output* que el specialist habría producido.
+**Sprint 2A — robustez del judge + techo de CI (`f973939`):**
+Dos correcciones de seguimiento del code review del PR #9:
+1. **El reintento de repair del judge preserva el contexto original** — `claude -p` es stateless, así que el repair prompt ahora vuelve a incluir el `judge_prompt` original completo (response + expectations) más la salida malformed anterior. Una nueva verificación `_judge_output_complete()` rechaza payloads que no tengan exactamente N expectations indexadas, evitando que el modelo emita un veredicto plausible-pero-fabricado cuando la salida del primer call es irrecuperable.
+2. **Timeout del job `behavioral-eval` de CI 90 → 120 min** — el peor caso = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 min, así que el techo previo de 90 min podía cancelar en silencio una corrida válida. 120 min deja ~10 min de margen para setup + artifact upload.
+**Línea base recién visible** (corrida local, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectations pasando, **6 critical + 14 warning** failures. El puntaje agregado no regresó — lo que regresó es el puntaje *visible*, porque cuatro evals que antes contribuían 0/0 ahora producen señal real. Los 6 critical failures son el backlog explícito de Stage 2: JTBD de 3 capas (funcional / emocional / social), Jobs a nivel organizacional B2B, separación de persona buyer vs user en B2B, guardarraíles de scope de Discovery, y disciplina de leading-indicator en pre-mortem. Ver [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) para el desglose por expectation.
+**Las mejoras de harness viven en `evals/` y `.github/workflows/` — no se publican a npm.** No hace falta version bump más allá de v1.2.9 (que ya cargó el hook a nivel de usuario y las ediciones de scope a los evals 10/11/12).
+**Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko).
 ---
 ## 🧪 Desarrollo y Evals
@@ -487,7 +514,7 @@ El directorio `evals/` incluye dos suites de pruebas complementarias y un scorer
 **Local (gratis, recomendado)**: ejecuta los mismos scripts con el CLI `claude` autenticado con tu suscripción Claude Pro/Max (un solo `claude login`). Sin API key, sin costo marginal. El sistema de eval está diseñado para correr localmente antes de cada release.
-**CI (opcional, pago)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI requiere el secret `ANTHROPIC_API_KEY` (GitHub Actions no puede usar OAuth en un contenedor headless); sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
+**CI (opcional, sin costo adicional)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI también usa tu suscripción Claude Pro/Max (sin API key, sin costo por token): configuración única, ejecuta `claude setup-token` localmente y agregá el token impreso como secret `CLAUDE_CODE_OAUTH_TOKEN` del repo. Sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
 ### Ejecución local
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # solo reporta, sin exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y requiere el secret `ANTHROPIC_API_KEY`.
+Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y la misma suscripción vía un secret `CLAUDE_CODE_OAUTH_TOKEN` (generado una sola vez con `claude setup-token`).
 ### Severity y scoring

package/README.ja.md CHANGED Viewed

@@ -480,6 +480,33 @@ token 削減イテレーション。スキル内容のセマンティクスは
 **5 つの i18n ロケール（zh-TW、zh-CN、ja、es、ko）にミラー** — 既存の翻訳を保持しつつ、構造的なスリム化を言語ごとに同一に適用。
+### イテレーション7:Eval Harness のレジリエンス強化（Sprint 1 + 2A、v1.2.9）
+ハーネス層のイテレーションであって、スキル層ではない。スキルのセマンティクスは変わっていない。変わったのは**測定対象の表面積**。目標は、ずっと 0/0 verdict を黙って出していた 4 つの eval を解除し、真の品質ベースラインを浮かび上がらせること。
+**Sprint 1 — 測定不能だったクラスタの解除（`d2023fb`、`cee67cb`):**
+4 つの eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)は毎回 0 pass / 0 fail を返しており、集計スコアでは「問題なし」と区別がつかなかった。3 つの原因:
+1. **headless CI で sub-agent が欠落** — CI はスキルを `~/.claude/skills/` にインストールしていたが、`agents/*.md` を `~/.claude/agents/` にコピーしていなかった。そのため `claude -p` は `Task` 経由で dispatch できず、orchestrator が黙って inline で実行していた。
+2. **`claude -p` 下で specialist-dispatch hook が無音** — plugin レベルの `hooks/` は headless モードでは読み込まれず、user レベルの `~/.claude/settings.json` の UserPromptSubmit hook のみが読み込まれる。CI は各 behavioral run の前に dispatch hook を user レベルにプログラム的に登録するようになった。
+3. **Response + judge の timeout が短すぎた** — 180s response / 120s judge では長文の Discovery / Pre-mortem 出力が途中で切れ、judge は切り詰められた文字列を見て 0/0 を吐いていた。600s / 240s に引き上げ、非 JSON 出力時には 1 回リトライ。
+また evals 10/11/12 から「orchestrator が Task ツール経由で dispatch する」という手続き的 expectation も削除した — `claude -p` には nested Task の表面がなく検証不能で、最終的に我々が気にする性質でもない。残りの expectation は specialist が産出すべき**アウトプット品質**を対象とする。
+**Sprint 2A — judge のロバストネス + CI 上限(`f973939`):**
+PR #9 のコードレビューからの 2 つのフォローアップ修正:
+1. **Judge 修復リトライがオリジナルの context を保持** — `claude -p` はステートレスなので、修復 prompt は完全な元の `judge_prompt`(response + expectations)と前回の malformed output を再投入するようになった。新しい `_judge_output_complete()` チェックは N 個の indexed expectation がぴったり揃っていないペイロードを拒否し、初回出力が回復不能なときに model が「形だけ整った捏造 verdict」を吐くのを防ぐ。
+2. **CI `behavioral-eval` ジョブの timeout 90 → 120 分** — 最悪ケース = 12 evals / 2 workers × (600s response + 240s judge + 240s repair)≈ 108 分なので、以前の 90 分上限は有効な run を黙って cancel する可能性があった。120 分は setup + artifact upload に ~10 分の余裕を残す。
+**新たに可視化されたベースライン**(ローカル run、2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation がパス、**6 critical + 14 warning** の失敗。集計スコアは退行していない — 退行したのは**可視**スコアで、これまで 0/0 を貢献していた 4 つの eval が今は実シグナルを返すようになったため。この 6 つの critical 失敗が Stage 2 の明示的な backlog:3 層 JTBD(functional / emotional / social)、B2B 組織レベルの Jobs、B2B buyer vs user persona の分離、Discovery scope のガードレール、pre-mortem の leading-indicator 規律。expectation 単位の内訳は [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) を参照。
+**ハーネス改善は `evals/` と `.github/workflows/` に住み、npm には出荷されない。** v1.2.9 を超えるバージョンバンプは不要(v1.2.9 にはすでに user-level hook と evals 10/11/12 の scope 調整が含まれている)。
+**5 つの i18n ロケール(zh-TW、zh-CN、ja、es、ko)にミラー**。
 ---
 ## 🧪 開発と評価
@@ -488,7 +515,7 @@ token 削減イテレーション。スキル内容のセマンティクスは
 **ローカル（無料、推奨）**：`claude` CLI を Claude Pro/Max サブスクリプションで認証して（一度だけ `claude login`）同じスクリプトを実行できます。API key 不要、追加コストなし。eval システムは各リリース前にローカルで実行する設計です。
-**CI（オプション、有料）**：`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI には `ANTHROPIC_API_KEY` secret が必要です（GitHub Actions は headless コンテナで OAuth が使えません）。secret 未設定時は eval job が**クリーンに skip**（グレー ⏭️）され、誤解を招く赤バツは出ません。
+**CI（オプション、追加課金なし）**：`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI もあなたの Claude Pro/Max サブスクリプションを使用します（API key 不要、トークン課金なし）：ローカルで `claude setup-token` を一度実行し、出力されたトークンを repo secret `CLAUDE_CODE_OAUTH_TOKEN` として追加してください。secret 未設定時は eval job が**クリーンに skip**（グレー ⏭️）され、誤解を招く赤バツは出ません。
 ### ローカル実行
@@ -509,7 +536,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # レポートのみ、exi
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-ローカルは `--runs 3` がデフォルト（多数決で LLM のばらつきを吸収）。`claude` CLI は Claude Pro/Max の OAuth セッション（`claude login`）を使うため、トークン課金はありません。CI は `--runs 1` で、`ANTHROPIC_API_KEY` secret が必要です。
+ローカルは `--runs 3` がデフォルト（多数決で LLM のばらつきを吸収）。`claude` CLI は Claude Pro/Max の OAuth セッション（`claude login`）を使うため、トークン課金はありません。CI は `--runs 1` で、同じサブスクリプションを `CLAUDE_CODE_OAUTH_TOKEN` secret 経由で利用します（`claude setup-token` で一度だけ生成）。
 ### Severity とスコアリング

package/README.ko.md CHANGED Viewed

@@ -479,6 +479,33 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
 **5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링** — 기존 번역을 보존하며, 구조적 슬림화는 언어별로 동일하게 적용.
+### 반복 7: Eval Harness 회복탄력성 강화 (Sprint 1 + 2A, v1.2.9)
+스킬 레벨이 아니라 harness 레벨의 반복. 스킬의 의미는 바뀌지 않았고, 바뀐 것은 **측정되는 표면**. 목표는 0/0 verdict 만 조용히 내고 있던 4개 eval 의 차단을 풀어, 진짜 품질 베이스라인을 드러내는 것.
+**Sprint 1 — 측정 불능 클러스터 해제 (`d2023fb`, `cee67cb`):**
+4개 eval (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) 이 매번 0 pass / 0 fail 을 내고 있어, 집계 점수에서 「문제 없음」과 구분이 불가능했음. 세 가지 원인:
+1. **headless CI 에서 sub-agent 누락** — CI 가 스킬을 `~/.claude/skills/` 에 설치하면서 `agents/*.md` 를 `~/.claude/agents/` 에 복사하지 않았음. 그래서 `claude -p` 는 `Task` 로 dispatch 할 수 없었고, orchestrator 가 조용히 inline 으로 실행했음.
+2. **`claude -p` 에서 specialist-dispatch hook 이 무음** — 플러그인 레벨 `hooks/` 는 headless 모드에서 로드되지 않으며, user 레벨 `~/.claude/settings.json` 의 UserPromptSubmit hook 만 로드됨. CI 는 이제 각 behavioral run 전에 dispatch hook 을 프로그램적으로 user 레벨에 등록함.
+3. **Response + judge timeout 이 너무 빡빡함** — 180s response / 120s judge 가 장문의 Discovery / Pre-mortem 출력을 중간에 잘랐고, judge 는 잘린 문자열을 보고 0/0 을 뱉었음. 600s / 240s 로 올리고, 비-JSON 출력 시 1회 재시도.
+또한 evals 10/11/12 에서 「orchestrator 가 Task tool 로 dispatch」 같은 절차적 expectation 을 제거 — `claude -p` 에 nested Task 표면이 없어 검증 불가하며, 우리가 최종적으로 신경 쓰는 성질도 아님. 남은 expectation 은 specialist 가 산출했어야 할 **출력 품질**을 대상으로 함.
+**Sprint 2A — judge 견고성 + CI 상한 (`f973939`):**
+PR #9 코드 리뷰의 두 가지 후속 수정:
+1. **Judge repair 재시도가 원본 context 보존** — `claude -p` 는 stateless 이므로, repair prompt 는 원본 `judge_prompt` (response + expectations) 전체와 이전 malformed output 을 다시 포함함. 새 `_judge_output_complete()` 체크가 정확히 N 개의 indexed expectation 이 없는 payload 를 거부하여, 첫 호출 출력이 복구 불가능할 때 model 이 「형태만 그럴듯한 위조 verdict」를 내는 것을 방지.
+2. **CI `behavioral-eval` 작업 timeout 90 → 120 분** — 최악 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 분이므로, 이전 90분 상한은 유효한 run 을 조용히 cancel 할 수 있었음. 120분은 setup + artifact upload 에 ~10분 여유.
+**새로 가시화된 베이스라인** (로컬 run, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectation 통과, **6 critical + 14 warning** 실패. 집계 점수가 퇴보한 것이 아니라, **가시적**인 점수가 퇴보한 것 — 이전에 0/0 을 기여하던 4개 eval 이 이제 실제 signal 을 냄. 이 6개 critical 실패가 Stage 2 의 명시적 backlog: 3계층 JTBD (functional / emotional / social), B2B 조직 수준 Jobs, B2B buyer vs user persona 분리, Discovery scope 가드레일, pre-mortem leading-indicator 규율. expectation 단위 내역은 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) 참조.
+**Harness 개선은 `evals/` 와 `.github/workflows/` 에 거주하며, npm 으로 출하되지 않음.** v1.2.9 이상의 버전 bump 불필요 (v1.2.9 가 이미 user-level hook 과 evals 10/11/12 의 scope 조정을 포함).
+**5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링**.
 ---
 ## 🧪 개발 및 평가
@@ -487,7 +514,7 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
 **로컬 (무료, 권장)**: `claude` CLI를 Claude Pro/Max 구독으로 인증해서 (한 번만 `claude login`) 같은 스크립트를 실행합니다. API key 불필요, 추가 비용 없음. eval 시스템은 각 릴리스 전에 로컬에서 실행하도록 설계되었습니다.
-**CI (선택, 유료)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI는 `ANTHROPIC_API_KEY` secret이 필요합니다 (GitHub Actions는 headless 컨테이너에서 OAuth를 사용할 수 없습니다). secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
+**CI (선택, 추가 과금 없음)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI도 Claude Pro/Max 구독을 사용합니다 (API key 불필요, 토큰 과금 없음): 로컬에서 `claude setup-token`을 한 번 실행하고, 출력된 토큰을 repo secret `CLAUDE_CODE_OAUTH_TOKEN`으로 추가하세요. secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
 ### 로컬 실행
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # 보고만, exit 1 없음
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 `ANTHROPIC_API_KEY` secret이 필요합니다.
+로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 동일한 구독을 `CLAUDE_CODE_OAUTH_TOKEN` secret을 통해 인증합니다 (`claude setup-token`으로 한 번만 생성).
 ### Severity 및 스코어링

package/README.md CHANGED Viewed

@@ -477,6 +477,33 @@ A token-reduction iteration. Same skill content semantics, smaller footprint per
 **Mirrored to 5 i18n locales** (zh-TW, zh-CN, ja, es, ko) preserving existing translations — structural slim applied identically per language.
+### Iteration 7: Eval Harness Resilience (Sprint 1 + 2A, v1.2.9)
+A harness-level iteration, not a skill-level one. No skill semantics changed; the *surface area being measured* did. Goal: surface the real quality baseline by unblocking 4 evals that had been silently producing 0/0 verdicts.
+**Sprint 1 — unblock unmeasurable clusters (`d2023fb`, `cee67cb`):**
+Four evals (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) had been producing 0 passes / 0 fails per run — indistinguishable from "no problems" in the aggregate score. Three causes:
+1. **Sub-agents missing in headless CI** — CI installed the skill at `~/.claude/skills/` but never copied `agents/*.md` to `~/.claude/agents/`. `claude -p` therefore couldn't dispatch via `Task`, and the orchestrator silently inline-ran.
+2. **Specialist-dispatch hook silent under `claude -p`** — plugin-level `hooks/` are not loaded in headless mode; only user-level `~/.claude/settings.json` UserPromptSubmit hooks are. CI now programmatically registers the dispatch hook at the user level before each behavioral run.
+3. **Response + judge timeouts too aggressive** — 180s response / 120s judge cut off long-form Discovery and Pre-mortem outputs mid-thought; the judge then saw a truncated string and emitted 0/0. Bumped to 600s / 240s with a single retry on non-JSON output.
+Also dropped procedural "orchestrator delegates via Task tool" expectations from evals 10/11/12 — those are unverifiable in `claude -p` (no nested Task surface) and not the property we ultimately care about. The remaining expectations target the *output quality* the specialist would have produced.
+**Sprint 2A — judge robustness + CI ceiling (`f973939`):**
+Two follow-on fixes from PR #9 code review:
+1. **Judge repair retry preserves original context** — `claude -p` is stateless, so the repair prompt now re-includes the full original `judge_prompt` (response + expectations) plus the previous malformed output. A new `_judge_output_complete()` check rejects payloads that don't have exactly N indexed expectations, preventing the model from emitting a plausibly-shaped but fabricated verdict when the first call's output is unrecoverable.
+2. **CI `behavioral-eval` job timeout 90 → 120 min** — worst case = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 min, so the previous 90-min ceiling could silently cancel an otherwise valid run. 120 min leaves ~10 min headroom for setup + artifact upload.
+**Newly visible baseline** (local run, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectations passing, **6 critical + 14 warning** failures. The aggregate score did not regress — what regressed is the *visible* score, because four evals that previously contributed 0/0 now produce real signal. The 6 critical failures are now the explicit Stage 2 backlog: 3-layer JTBD (functional / emotional / social), B2B organization-level Jobs, B2B buyer-vs-user persona separation, Discovery-scope guardrails, and pre-mortem leading-indicator discipline. See [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) for the per-expectation breakdown.
+**Harness improvements live in `evals/` and `.github/workflows/` — they do not ship to npm.** No version bump beyond v1.2.9 (which carried the user-level hook + scope edits to evals 10/11/12).
+**Mirrored to 5 i18n locales** (zh-TW, zh-CN, ja, es, ko).
 ---
 ## 🧪 Development & Evals
@@ -485,7 +512,7 @@ The `evals/` directory ships two complementary test suites and a deterministic s
 **Local (free, recommended):** run the same scripts with the `claude` CLI authenticated via your Claude Pro/Max subscription (`claude login` once). No API key, no marginal cost. The eval system is designed to be run locally before each release.
-**CI (optional, paid):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI requires an `ANTHROPIC_API_KEY` secret because GitHub Actions cannot use OAuth in a headless container; without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
+**CI (optional, no extra billing):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI runs on your Claude Pro/Max subscription (no API key, no per-token cost): one-time setup is `claude setup-token` locally, then add the printed token as repo secret `CLAUDE_CODE_OAUTH_TOKEN`. Without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
 ### Running locally
@@ -506,7 +533,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # report without exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and requires the `ANTHROPIC_API_KEY` secret.
+Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and the same subscription via a `CLAUDE_CODE_OAUTH_TOKEN` secret (generated once with `claude setup-token`).
 ### Severity & scoring

package/README.zh-CN.md CHANGED Viewed

@@ -479,6 +479,33 @@ Claude Code 会自动：
 **已同步至 5 个 i18n 语系**（zh-TW、zh-CN、ja、es、ko），保留既有译文 —— 结构性瘦身按语系一致套用。
+### Iteration 7：Eval Harness 韧性强化（Sprint 1 + 2A，v1.2.9）
+Harness 层迭代，不是 skill 层。Skill 语意没变，变的是**被测量的表面**。目标：解除 4 个一直在悄悄产出 0/0 verdict 的 eval，让真实品质基线浮出水面。
+**Sprint 1 — 解锁无法测量的群集（`d2023fb`、`cee67cb`）：**
+4 个 eval（`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`）每次都产出 0 pass / 0 fail，在汇总分数中与「没问题」无法区分。三个原因：
+1. **CI headless 模式缺少 sub-agent** —— CI 把 skill 装到 `~/.claude/skills/`，却没把 `agents/*.md` 复制到 `~/.claude/agents/`。`claude -p` 因此无法透过 `Task` 派发，orchestrator 只能默默 inline 执行。
+2. **Specialist-dispatch hook 在 `claude -p` 不会载入** —— plugin 层的 `hooks/` 在 headless 模式不会载入，只有 user 层 `~/.claude/settings.json` 的 UserPromptSubmit hook 会。CI 现在会在每次 behavioral run 之前以程序方式把 dispatch hook 注册到 user 层。
+3. **Response + judge timeout 太紧** —— 180s response / 120s judge 会把长篇 Discovery、Pre-mortem 输出中途切断，judge 看到截断字串就吐出 0/0。提升到 600s / 240s，且非 JSON 输出时重试一次。
+同时也从 evals 10/11/12 删掉「orchestrator 必须透过 Task 派发」这类程序性 expectation —— 在 `claude -p` 没有 nested Task 介面，无法验证，也不是我们最终在意的性质。留下的 expectation 都针对 specialist 应产出的**输出质量**。
+**Sprint 2A — judge 韧性 + CI 上限（`f973939`）：**
+PR #9 review 之后的两个跟进修正：
+1. **Judge 修复重试保留原始 context** —— `claude -p` 是无状态的，所以修复 prompt 现在会重新带入完整原始 `judge_prompt`（response + expectations）加上前一次的 malformed output。新的 `_judge_output_complete()` 检查会拒绝「没有完整 N 个 indexed expectation」的回应，避免 model 在第一次输出无法救援时凭空捏造一份看起来合理的 verdict。
+2. **CI `behavioral-eval` job timeout 90 → 120 分钟** —— 最坏情况 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 分钟，先前 90 分钟上限可能默默 cancel 整轮 run。120 分钟给 setup + artifact upload 留 ~10 分钟余裕。
+**新可见的基线**（本机 run，2026-05-28）：**0 / 100** `at-risk`、**13 / 33** expectation 通过、**6 critical + 14 warning** 失败。汇总分数并没有退步，退的是**可见**分数 —— 四个原本贡献 0/0 的 eval 现在开始产出真实 signal。这 6 个 critical 失败就是 Stage 2 明确的待修清单：三层 JTBD（functional / emotional / social）、B2B 组织层 Jobs、B2B buyer vs user persona 分离、Discovery scope 守备、pre-mortem leading-indicator 纪律。逐项细节见 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md)。
+**Harness 改进住在 `evals/` 与 `.github/workflows/`，不会发到 npm。** 版本不需要再往 v1.2.9 之上 bump（v1.2.9 已经包含 user-level hook 与 evals 10/11/12 的 scope 调整）。
+**已同步至 5 个 i18n 语系**（zh-TW、zh-CN、ja、es、ko）。
 ---
 ## 🧪 开发与评测
@@ -487,7 +514,7 @@ Claude Code 会自动：
 **本地（免费，推荐）**：用 `claude` CLI 搭配你的 Claude Pro/Max 订阅（先 `claude login` 一次）跑这些 script。不需要 API key、没有额外成本。整套 eval 系统就是设计来在每次发版前本地跑一遍。
-**CI（可选，付费）**：`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`（含 `package.json` 变动）时跑这两套，把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 需要 `ANTHROPIC_API_KEY` secret（GitHub Actions 在 headless 容器无法走 OAuth）；没设 secret 时 eval job **会干净地 skip**（灰色 ⏭️），不会出现误导的红叉。
+**CI（可选，不额外计费）**：`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`（含 `package.json` 变动）时跑这两套，把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 同样走你的 Claude Pro/Max 订阅（不需 API key、没有按 token 计费的成本）：一次性设置为本机 `claude setup-token` 生成长期 token，把它加进 repo secret `CLAUDE_CODE_OAUTH_TOKEN`。没设 secret 时 eval job **会干净地 skip**（灰色 ⏭️），不会出现误导的红叉。
 ### 本地执行
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # 只报告，不 exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-本地默认 `--runs 3`（多数决可吸收 LLM 变异性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），没有按 token 计费的成本。CI 用 `--runs 1` 并需要 `ANTHROPIC_API_KEY` secret。
+本地默认 `--runs 3`（多数决可吸收 LLM 变异性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），没有按 token 计费的成本。CI 用 `--runs 1`，靠同一个订阅通过 `CLAUDE_CODE_OAUTH_TOKEN` secret 认证（用 `claude setup-token` 一次性生成）。
 ### Severity 与计分

package/README.zh-TW.md CHANGED Viewed

@@ -478,6 +478,33 @@ Claude Code 會自動：
 **5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko),保留既有翻譯——結構性瘦身在各語系等比例套用。
+### Iteration 7:Eval Harness 韌性強化(Sprint 1 + 2A,v1.2.9)
+Harness 層的迭代,不是 skill 層。Skill 語意沒變,變的是**被測量的表面**。目標:解除 4 個一直在悄悄產出 0/0 verdict 的 eval,讓真實品質基線浮出水面。
+**Sprint 1 — 解鎖無法測量的群集(`d2023fb`、`cee67cb`):**
+4 個 eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)每次都產出 0 pass / 0 fail,在彙總分數中與「沒問題」無法區分。三個原因:
+1. **CI headless 模式缺少 sub-agent** — CI 把 skill 裝到 `~/.claude/skills/`,卻沒把 `agents/*.md` 複製到 `~/.claude/agents/`。`claude -p` 因此無法透過 `Task` 派發,orchestrator 只能默默 inline 執行。
+2. **Specialist-dispatch hook 在 `claude -p` 不會載入** — plugin 層的 `hooks/` 在 headless 模式不會載入,只有 user 層 `~/.claude/settings.json` 的 UserPromptSubmit hook 會。CI 現在會在每次 behavioral run 之前以程式碼方式把 dispatch hook 註冊到 user 層。
+3. **Response + judge timeout 太緊** — 180s response / 120s judge 會把長篇 Discovery、Pre-mortem 輸出中途切斷,judge 看到截斷字串就吐出 0/0。提升到 600s / 240s,且非 JSON 輸出時重試一次。
+同時也從 evals 10/11/12 刪掉「orchestrator 必須透過 Task 派發」這類程序性 expectation——在 `claude -p` 沒有 nested Task 介面,無法驗證,也不是我們最終在意的性質。留下的 expectation 都針對 specialist 應產出的**輸出品質**。
+**Sprint 2A — judge 韌性 + CI 上限(`f973939`):**
+PR #9 review 之後的兩個跟進修正:
+1. **Judge 修復重試保留原始 context** — `claude -p` 是無狀態的,所以修復 prompt 現在會重新帶入完整原始 `judge_prompt`(response + expectations)加上前一次的 malformed output。新的 `_judge_output_complete()` 檢查會拒絕「沒有完整 N 個 indexed expectation」的回應,避免 model 在第一次輸出無法救援時憑空捏造一份看起來合理的 verdict。
+2. **CI `behavioral-eval` job timeout 90 → 120 分鐘** — 最壞情況 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 分鐘,先前 90 分鐘上限可能默默 cancel 整輪 run。120 分鐘給 setup + artifact upload 留 ~10 分鐘餘裕。
+**新可見的基線**(本機 run,2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation 通過、**6 critical + 14 warning** 失敗。彙總分數並沒有退步,退的是**可見**分數——四個原本貢獻 0/0 的 eval 現在開始產出真實 signal。這 6 個 critical 失敗就是 Stage 2 明確的待修清單:三層 JTBD(functional / emotional / social)、B2B 組織層 Jobs、B2B buyer vs user persona 分離、Discovery scope 守備、pre-mortem leading-indicator 紀律。逐項細節見 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md)。
+**Harness 改進住在 `evals/` 與 `.github/workflows/`,不會發到 npm。** 版本不需要再往 v1.2.9 之上 bump(v1.2.9 已經包含 user-level hook 與 evals 10/11/12 的 scope 調整)。
+**5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko)。
 ---
 ## 🧪 開發與評測
@@ -486,7 +513,7 @@ Claude Code 會自動：
 **本地（免費，推薦）**：用 `claude` CLI 搭配你的 Claude Pro/Max 訂閱（先 `claude login` 一次）跑這些 script。不需要 API key、沒有額外成本。整套 eval 系統就是設計來在每次發版前本地跑一遍。
-**CI（選用，付費）**：`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`（含 `package.json` 變動）時跑這兩套，把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 需要 `ANTHROPIC_API_KEY` secret（GitHub Actions 在 headless 容器無法走 OAuth）；沒設 secret 時 eval job **會乾淨地 skip**（灰色 ⏭️），不會出現誤導的紅叉。
+**CI（選用，不額外計費）**：`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`（含 `package.json` 變動）時跑這兩套，把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 同樣走你的 Claude Pro/Max 訂閱（不需 API key、沒有按 token 計費的成本）：一次性設定為本機 `claude setup-token` 產生長期 token，把它加進 repo secret `CLAUDE_CODE_OAUTH_TOKEN`。沒設 secret 時 eval job **會乾淨地 skip**（灰色 ⏭️），不會出現誤導的紅叉。
 ### 本地執行
@@ -507,7 +534,7 @@ python3 evals/run_behavioral_eval.py --fail-on none   # 只報告，不 exit 1
 python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
 ```
-本地預設 `--runs 3`（多數決可吸收 LLM 變異性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），沒有按 token 計費的成本。CI 用 `--runs 1` 並需要 `ANTHROPIC_API_KEY` secret。
+本地預設 `--runs 3`（多數決可吸收 LLM 變異性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），沒有按 token 計費的成本。CI 用 `--runs 1`，靠同一個訂閱透過 `CLAUDE_CODE_OAUTH_TOKEN` secret 認證（用 `claude setup-token` 一次性產生）。
 ### Severity 與計分

package/SKILL.md CHANGED Viewed

@@ -212,6 +212,20 @@ Task(
 **Genuine false-positive exception**: if the prompt has no real connection to a specialist's scope (e.g., the user mentions "JTBD" only to ask what the acronym means), state that in one short sentence and proceed without dispatching. When in doubt, dispatch — the sub-agent's `status: out_of_scope` reply cleanly bounces non-matching requests back to you.
+### Reference fallback when Task dispatch is unavailable
+Some environments cannot dispatch sub-agents (notably `claude -p` headless runs, some MCP harnesses, and certain CI eval contexts). In those environments the `Task` tool is absent or inert, so the dispatch above will silently inline-collapse. To prevent content collapse, **before producing inline output for any matched trigger row, you MUST read the corresponding reference files and treat their Hard Gates as your own**:
+| Specialist (if dispatch fails / unavailable) | Reference files to read FIRST, then satisfy Hard Gates inline |
+|---|---|
+| `discovery-specialist` | `references/02a-persona.md` (Persona structure + B2B Buyer/User Hard Gate + B2B Prioritization vocabulary) AND `references/02b-jtbd.md` (3-layer JTBD + B2B Org-Level Jobs Hard Gates) AND `references/rules-quality-review.md` (✅/❌ marker format + ≥1 ❌ Hard Gate). Add `references/02c-ost-journey.md` if the request includes OST or Journey Map. |
+| `strategy-critic` | `references/01-strategy.md` (Rumelt diagnosis + three-questions critique format) AND `references/rules-quality-review.md` |
+| `pre-mortem-runner` | `references/04-develop.md` (Pre-mortem section — 15+ scenarios across 5 categories + leading-indicator format) AND `references/rules-quality-review.md` |
+**Quality self-review is always required.** Whenever the user prompt asks for a quality self-review, checklist, or step-end critique — or whenever you are about to emit step-end output of any kind — you MUST have read `references/rules-quality-review.md` and follow its exact `✅`/`❌` marker format with at least one `❌` on a substantive content gap. This is non-negotiable regardless of whether dispatch was attempted or whether the fallback path was used.
+This is **not** a license to skip dispatch when it IS available. The order is: (1) attempt dispatch; (2) if the Task tool is unavailable or the call cannot complete, read the listed references and produce specialist-grade output inline; (3) cite that you used the inline fallback in one short note at the end ("Inline fallback used — Task dispatch unavailable in this environment."). The references above embed the same Hard Gates the specialist would have enforced, so following them faithfully closes the quality gap.
 Full per-trigger invocation templates: `references/rules-subagent-dispatch.md`. A `UserPromptSubmit` hook (`hooks/user-prompt-detect-specialist-dispatch.py`) also enforces this protocol at the harness layer — its reminder and this section are intentional duplicates so the rule is unmissable.
 ---

package/i18n/es/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # Etapa 1: Descubrimiento — Construyendo Personas
+### 🚫 Alcance de Salida de Descubrimiento (Hard Gate)
+Cuando el orchestrator recibe el pedido de ejecutar trabajo de Descubrimiento (Persona, JTBD, OST, Journey Map, Descubrimiento Continuo), la salida debe **mantenerse dentro del alcance de Descubrimiento**. Descubrimiento responde "quién es el usuario" y "qué necesidad insatisfecha intenta cubrir" — nada más. Los siguientes artefactos de etapas downstream **NO deben aparecer** en un entregable de Descubrimiento, incluso si se siente natural mencionarlos:
+- **Artefactos de etapa Define**: declaración de positioning, preguntas HMW (How Might We), matrices de pain points que doblan como prompts de solución
+- **Artefactos de etapa Develop**: borradores de PR-FAQ, escenarios de pre-mortem, tablas RICE, definición de scope de MVP, secciones de PRD, listas de features
+- **Artefactos de etapa Deliver**: definición de métrica North Star, criterios de PMF, plan GTM, bloques de business-model canvas, tablas de product spec
+- **Artefactos de etapa Strategy**: Strategy Blocks, diagnosis / guiding-policy / coherent-action de Rumelt, descomposición de DHM Model, escalas OKR
+Si los hallazgos de Descubrimiento sugieren fuertemente un artefacto downstream (ej. el análisis JTBD revela un ángulo de positioning claro), regístralo como una **open question o next-step pointer de una línea** al final — pero **NO produzcas el artefacto en sí**. La siguiente etapa tiene su propio step dedicado.
+Ejemplo no aceptable: terminar un análisis JTBD con una tabla RICE poblada, una lista de scope de MVP, o un párrafo "Recommended Positioning" — incluso si todas las otras sub-secciones de Descubrimiento están correctas, esta salida FALLA este Hard Gate.
+---
 ## Hábitos de Descubrimiento Continuo (Teresa Torres)
 Construye un hábito clave: **Habla con al menos un usuario objetivo cada semana.** El descubrimiento no es un ritual único — es un sistema continuo.
@@ -10,6 +25,17 @@ Construye un hábito clave: **Habla con al menos un usuario objetivo cada semana
 Las Personas no se segmentan por edad y género, sino por **propósito / tarea / motivación** para distinguir diferentes tipos de usuarios.
+### 🏢 Hard Gate B2B — Persona Buyer ≠ Persona User
+Para cualquier producto B2B (o B2B2C), el **Buyer** (firma el contrato, controla el presupuesto, asume riesgo de vendor) y el **User diario** (toca el producto todos los días) son casi siempre roles distintos con **objetivos, pain points y criterios de decisión diferentes**. Tratarlos como una sola Persona colapsa dos Jobs distintos en un arquetipo borroso y el análisis resultante no puede impulsar decisiones de producto.
+Regla del Hard Gate:
+- Producir **dos bloques de Persona separados** etiquetados `Buyer` y `User` cuando el producto es B2B y los dos roles son distintos (suposición por defecto en B2B).
+- Si son la misma persona (raro — usualmente herramientas fundador-led o B2B de un solo dueño), explicá en una oración por qué el buyer también es el user diario en este escenario específico.
+- Cross-link entre las dos Personas: notá dónde el criterio de evaluación del Buyer depende de lo que el User realmente hace a diario (ej. "el criterio de audit-readiness del Buyer depende de que el User complete el formulario el mismo día y no en lote").
+Ejemplo no aceptable: producir una sola Persona ("HR Manager") que fusiona aprobar presupuesto Y completar formularios diarios — dos Jobs distintos forzados en un arquetipo borroso. Esa salida FALLA este Hard Gate.
 ```
 | Campo | Persona 1: [Apodo] | Persona 2: [Apodo] | Persona 3: [Apodo] |
 |---|---|---|---|
@@ -24,6 +50,22 @@ Las Personas no se segmentan por edad y género, sino por **propósito / tarea /
 Explica la lógica de segmentación; verifica MECE (mutuamente excluyente, colectivamente exhaustivo); identifica el TA primario y secundario.
+### 🎯 Reasoning de Priorización de Persona (Hard Gate)
+Decir solo "identificar TA primario" sin un reasoning explícito falla este Hard Gate. La declaración de priorización debe nombrar una Persona como primaria Y explicar por qué **en términos específicos a la dinámica go-to-market del producto** — no claims genéricos de "frecuencia de uso".
+Para **productos B2B con múltiples user personas**, el reasoning DEBE referenciar **al menos una** de estas dinámicas específicas B2B por nombre (usando estos términos o equivalentes claramente análogos):
+- **Champion vs Buyer** — quién aboga internamente por la adopción versus quién firma el contrato; la adopción champion-led suele ganar la priorización B2B incluso cuando el buyer es la persona "más senior"
+- **Adoption multiplier** — quién, al adoptar, desbloquea la adopción para el resto de la org (ej. el uso diario del HR Specialist siembra el system-of-record del que otras personas dependen después)
+- **Switching-trigger ownership** — qué persona siente el dolor que justifica cambiar de la herramienta incumbente; quien posee el switching trigger es el candidato a priorización incluso si no es el usuario más pesado
+- **Budget authority** — quién controla la línea de presupuesto; relevante cuando buyer ≠ user y los criterios del buyer dominan la decisión inicial del deal
+- **Audit / compliance pressure ownership** — el rol de quién está en juego cuando aparecen hallazgos de auditoría; las personas presionadas por compliance suelen dominar la priorización en segmentos B2B regulados
+Un reasoning puro de "Persona X la usa más" o "Persona Y tiene más usuarios" FALLA este Hard Gate para productos B2B. La frecuencia es necesaria pero nunca suficiente — el switching B2B es impulsado por presión organizacional, no por tasas de uso individual.
+Para **productos B2C**, el reasoning debe referenciar al menos uno de: switching-trigger ownership, diferencial de severidad JTBD, network-effect seeding, o diferencial de willingness-to-pay. El reasoning puramente por frecuencia también falla para B2C.
 ### 📝 Lista de Verificación de Calidad de Persona
 - ✅ ¿La segmentación está basada en "propósito/tarea/motivación" en lugar de datos demográficos?
 - ✅ ¿Las Personas son MECE (mutuamente excluyentes y colectivamente exhaustivas del mercado objetivo)?

package/i18n/es/references/02b-jtbd.md CHANGED Viewed

@@ -4,6 +4,10 @@
 > "La unidad de análisis no es el consumidor, sino el trabajo que el consumidor está tratando de realizar." — Clayton Christensen
+**Cobertura JTBD de Tres Capas (Hard Gate — las tres capas requeridas):**
+Cada análisis JTBD DEBE hacer aflorar **las tres capas explícitamente**: **Funcional** (la tarea que se está completando), **Emocional** (cómo el usuario quiere sentirse durante/después), y **Social** (cómo el usuario quiere ser percibido). Producir solo la capa Funcional es el fallo más común en JTBD — los Jobs Emocionales y Sociales suelen ser los verdaderos disparadores de switching, especialmente en B2B. Si una Persona dada genuinamente no tiene un Job Emocional o Social significativo para el producto, decilo explícitamente con una oración de reasoning en lugar de omitir silenciosamente la fila.
 **Forma Canónica JTBD (Hard Gate — se requiere estructura de tres cláusulas):**
 Cada declaración JTBD (Primary, Funcional, Emocional, Social — cada capa) DEBE escribirse como una oración completa de tres cláusulas en la forma canónica. Las tres cláusulas son obligatorias:
@@ -59,7 +63,7 @@ Claude debe autoevaluar después de producir el output JTBD (cada ítem debe mar
 - [ ] ¿Se enfoca en un solo trabajo central? (No tres trabajos metidos en una sola oración)
 - [ ] ¿Puede usarse para evaluar "¿Esta solución realmente aborda este trabajo?"
 - [ ] ¿Incluye "soluciones alternativas actuales" y "brecha"? (Brecha = oportunidad)
-- [ ] ¿La P5 de la Profundización alcanza motivación emocional / identidad profesional / miedo psicológico? (No solo descripciones funcionales)
+- [ ] ¿La P5 de la Profundización **usa explícitamente al menos una palabra del vocabulario canónico** (`fear`, `anxiety`, `shame`, `worry`, `dread`, `self-doubt`, `sense of loss`, `threat to identity`, `embarrassment`, `guilt`)? Paráfrasis consecuenciales como "credibilidad en riesgo" o "reputación dañada" FALLAN este ítem — son consecuencias, no la emoción sentida.
 **Reglas de Ejecución (Hard Gate):**
 - Debe marcar cada ítem ✅ o ❌ — listas [ ] en blanco o ✅ sin explicación no están permitidas
@@ -68,11 +72,13 @@ Claude debe autoevaluar después de producir el output JTBD (cada ítem debe mar
 ---
-### 🏢 Requisitos de Profundización para Productos B2B
+### 🏢 Requisitos de Profundización para Productos B2B (Hard Gate)
-**Productos B2B (incluyendo B2B2C) deben completar el siguiente análisis:**
+**Hard Gate — para cualquier producto B2B (o B2B2C), los siguientes tres sub-análisis son TODOS obligatorios. Saltarse cualquiera es una contract failure, sin importar si el usuario lo pidió explícitamente.** Si el tipo de producto es ambiguo, hacé una pregunta de clarificación; no asumas silenciosamente B2C.
-#### Análisis de Jobs a Nivel Organizacional (Obligatorio — cubrir al menos 2 niveles)
+#### Análisis de Jobs a Nivel Organizacional (Hard Gate — cubrir al menos 2 niveles)
+Un análisis JTBD B2B que se queda puramente al nivel de usuario individual FALLA este gate. Los Jobs a nivel organizacional (auditoría de cumplimiento, flujos de aprobación cross-departamentales, control de costos, alineación de políticas de headcount, integridad de pista de auditoría) son necesidades que existen más allá de la tarea diaria de cualquier usuario individual y rutinariamente dominan las decisiones de switching B2B. La tabla de abajo DEBE producirse y al menos 2 de los 3 niveles DEBEN contener Jobs específicos de B2B (no enunciados genéricos de productividad).
 | Nivel | Descripción | Ejemplos |
 |-------|-------------|----------|
@@ -80,20 +86,33 @@ Claude debe autoevaluar después de producir el output JTBD (cada ítem debe mar
 | **Job Operacional** | Necesidades de coordinación a nivel proceso/gerente de departamento | Gestión de flujo de aprobaciones, sincronización de información entre equipos |
 | **Job de Tarea** | Necesidades operativas diarias de usuarios individuales | Llenar formularios, verificar estados, exportar reportes |
-#### Análisis Comprador vs. Usuario (Obligatorio)
+#### Análisis Comprador (Buyer) vs. Usuario (User) (Hard Gate)
+El comprador de un producto B2B (firma el contrato, controla el presupuesto) y el usuario diario (toca el producto todos los días) son casi siempre dos roles, **correspondientes a Jobs diferentes**. Tratarlos como una sola Persona es el fallo más común en Descubrimiento B2B. Regla del Hard Gate:
+- Si buyer ≠ user (suposición por defecto en B2B), producir **dos bloques separados de Persona+JTBD**: uno para el Buyer (justificación de ROI, reducción de riesgo, compliance, consolidación de vendors, audit-readiness), y uno para el User (eficiencia, reducción de errores, contexto de uso diario). Y cross-link: notá dónde el Job del Buyer depende del Job del User (ej. "el Job de compliance del Buyer depende de que el User realmente complete el reporte cada ciclo, no de que lo haga en batch al final del mes").
+- Si buyer = user (raro — usualmente herramientas fundador-led), explicá en una oración por qué en este escenario específico el tomador de decisiones también es el usuario diario — no lo asumas silenciosamente.
+- Ejemplo no aceptable: producir una sola Persona ("HR Manager") que fusiona la autoridad de aprobar presupuesto Y el llenado diario de formularios. Eso colapsa dos Jobs distintos en un rol borroso y el análisis no puede impulsar decisiones de producto.
+#### Cinco Preguntas de Profundización — Versión Mejorada B2B (Hard Gate)
+**Hard Gate — la P5 DEBE usar explícitamente al menos una palabra del siguiente vocabulario canónico**: `fear` (miedo), `anxiety` (ansiedad), `shame` (vergüenza), `worry` (preocupación), `dread` (pavor), `self-doubt` (auto-duda), `sense of loss` (sensación de pérdida), `threat to identity` (amenaza a la identidad), `embarrassment` (bochorno), `guilt` (culpa). Paráfrasis funcionales/consecuenciales como "credibilidad en riesgo", "reputación dañada", "métrica cae", "usuarios churnean", "pierde confianza", "impacto en la carrera" FALLAN este gate aunque describan stakes B2B reales — son *consecuencias*, no la *emoción sentida* de la cual el persona quiere *alejarse*.
+Una P5 que describe la motivación más profunda del persona solo en lenguaje funcional FALLA el contrato de Descubrimiento: el propósito entero de la P5 es hacer aflorar el *miedo/ansiedad sentido* que impulsa el switching — los resultados puramente funcionales pueden resolverse con herramientas incrementalmente mejores, pero es el fear/anxiety sentido lo que hace que un buyer B2B supere la inercia organizacional y firme un nuevo contrato.
-Si el comprador y el usuario son personas diferentes, analiza sus JTBD por separado:
-- **Job del Comprador**: Jobs que influyen la decisión de compra (justificación de ROI, reducción de riesgos, requisitos de cumplimiento)
-- **Job del Usuario**: Jobs que necesitan realizarse durante operaciones diarias (mejoras de eficiencia, reducción de errores)
-- Si son la misma persona, explica "por qué el tomador de decisiones es también el usuario en este escenario"
+**Ejemplos válidos** (cada uno contiene una palabra del vocabulario canónico):
+- ✅ Identidad profesional: "Ella **teme (fears)** verse incompetente frente al liderazgo cuando este reporte representa la credibilidad de su departamento"
+- ✅ Motivación emocional: "Él carga una **ansiedad (anxiety)** silenciosa de que sus reportes directos descubran que en realidad no tiene un control firme de los números"
+- ✅ Miedo psicológico: "Su mayor **pavor (dread)** es que el auditor encuentre una brecha en el proceso — ya le llamaron la atención una vez, y la **vergüenza (shame)** de un segundo incidente marcaría su expediente para siempre"
+- ✅ Amenaza a la identidad: "Él siente una **amenaza a la identidad (threat to identity)** cuando consultores externos explican mejor que él las métricas de su propio equipo frente a la junta"
-#### Cinco Preguntas de Profundización — Versión Mejorada B2B
+**Ejemplos no válidos** (funcional/consecuencial, sin vocabulario canónico):
+- ❌ "Necesita una mejor herramienta para mejorar la eficiencia" (funcional)
+- ❌ "Su credibilidad con el liderazgo está en riesgo" (consecuencia, no emoción sentida)
+- ❌ "Podría perder su trabajo si este reporte está mal" (resultado, no emoción — ¿qué *siente* sobre esa posibilidad?)
+- ❌ "Su reputación en la organización se vería afectada" (consecuencia — reemplazar con `embarrassment`, `shame`, o `dread`)
-**La P5 debe alcanzar al menos uno de los siguientes niveles** (ejemplos):
-- ✅ Identidad profesional: "Tiene miedo de verse incompetente frente al liderazgo, porque este reporte representa la credibilidad de su departamento"
-- ✅ Motivación emocional: "Quiere demostrar a sus reportes directos que tiene un control firme de los números"
-- ✅ Miedo psicológico: "Su mayor miedo es que el auditor encuentre una brecha en el proceso — ya le llamaron la atención una vez"
-- ❌ Ejemplo fallido: "Necesita una mejor herramienta para mejorar la eficiencia" (se queda a nivel funcional)
+Si la motivación más profunda del persona genuinamente no mapea a ninguna palabra del vocabulario canónico después de un análisis honesto, marcá el ítem P5 de la Checklist de Calidad JTBD como `❌` con la explicación "La P5 actualmente vive a nivel de consecuencia; se necesita una pregunta más de entrevista que sondee la emoción sentida" — no parafrasees la lista de vocabulario para hacer aparecer una marca de verificación.
 #### Análisis de Alternativas Competitivas (Obligatorio)

package/i18n/ja/references/02a-persona.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # ステージ1：ディスカバリー — ペルソナ構築
+### 🚫 ディスカバリー出力スコープ(Hard Gate)
+orchestrator がディスカバリー作業(ペルソナ、JTBD、OST、Journey Map、Continuous Discovery)を実行するよう要求されたとき、出力は**ディスカバリースコープ内に留まる**必要があります。ディスカバリーは「ユーザーは誰か」「彼らが満たそうとしている未充足ニーズは何か」に答える — それ以外は対象外です。次の下流ステージの成果物は、たとえ自然に思えても**ディスカバリー成果物に含めてはなりません**:
+- **Define ステージ成果物**:ポジショニングステートメント、HMW(How Might We)質問、解法プロンプトを兼ねるペインポイントマトリクス
+- **Develop ステージ成果物**:PR-FAQ ドラフト、pre-mortem シナリオ、RICE テーブル、MVP スコープ定義、PRD セクション、機能リスト
+- **Deliver ステージ成果物**:North Star メトリック定義、PMF 基準、GTM 計画、ビジネスモデル要素、製品仕様表
+- **Strategy ステージ成果物**:Strategy Blocks、Rumelt の diagnosis / guiding-policy / coherent-action、DHM Model 分解、OKR ハイアラキー
+ディスカバリーの発見が下流の成果物を強く示唆する場合(例:JTBD 分析から明確なポジショニング角度が浮上)、文末に**一行の open question または next-step pointer** として記録 — 自分でその成果物を生成しないでください。次のステージにはそれ専用のステップがあります。
+不合格例:JTBD 分析の末尾に埋められた RICE テーブル、MVP スコープリスト、または「Recommended Positioning」段落を付ける — 他のディスカバリーサブセクションが正しくても、この出力は Hard Gate に FAIL します。
+---
 ## Continuous Discovery Habits（Teresa Torres）
 1つの重要な習慣を構築してください：**毎週少なくとも1人のターゲットユーザーと話す。** ディスカバリーは一回限りの儀式ではなく、継続的なシステムです。
@@ -10,6 +25,17 @@
 ペルソナは年齢や性別で分類するのではなく、**目的 / タスク / モチベーション**で異なるタイプのユーザーを区別します。
+### 🏢 B2B Hard Gate — Buyer ペルソナ ≠ User ペルソナ
+すべての B2B(または B2B2C)製品において、**Buyer**(契約締結、予算管理、ベンダーリスク保有)と**日常 User**(毎日製品に触れる)はほぼ常に**目標、ペインポイント、意思決定基準が異なる**別々の役割です。これらを 1 つのペルソナにまとめると、2 つの異なる Job を 1 つの曖昧なアーキタイプに押し込めることになり、分析結果は製品意思決定を駆動できません。
+Hard Gate ルール:
+- B2B では **`Buyer` と `User` というラベルの 2 つの別個のペルソナブロック**を生成するのがデフォルト(2 つの役割が明確に異なるとき = B2B のデフォルト想定)。
+- 同一人物の場合(例外的 — 通常は創業者主導ツールまたは個人事業主 B2B)、「なぜこのシナリオで Buyer が日常 User でもあるのか」を 1 文で明示してください。
+- 2 つのペルソナ間でクロスリンク:Buyer の評価基準が User の日常行動にどこで依存するかをメモ(例:「Buyer の監査準備基準は、User が当日にフォームを記入するかどうかに依存 — 後追いでまとめて記入するのではなく」)。
+不合格例:1 つのペルソナ(「HR マネージャー」)のみを生成し、「予算承認」と「毎日の休暇申請フォーム記入」という 2 つの異なる Job を 1 つの曖昧なアーキタイプに押し込める — この出力は Hard Gate に FAIL します。
 ```
 | フィールド | ペルソナ1：[ニックネーム] | ペルソナ2：[ニックネーム] | ペルソナ3：[ニックネーム] |
 |---|---|---|---|
@@ -24,6 +50,22 @@
 セグメンテーションロジックを説明し、MECE（相互排他的で網羅的）であるか確認し、プライマリTAとセカンダリTAを特定してください。
+### 🎯 ペルソナ優先順位付け reasoning(Hard Gate)
+「プライマリ TA を特定する」とだけ言って具体的な reasoning がないのは Hard Gate に不合格です。優先順位付けのステートメントは**1 つの**ペルソナをプライマリとして指定し、なぜそうなのかを**その製品の go-to-market ダイナミクスに固有の言語**で説明する必要があります — 一般的な「使用頻度が高い」というような理由ではなく。
+複数の user ペルソナを持つ **B2B 製品**では、reasoning は次の B2B 固有ダイナミクスのうち**少なくとも 1 つ**を名前で参照する必要があります(これらの用語または明らかに等価な概念を使用):
+- **Champion vs Buyer** — 組織内で誰が採用を擁護するか vs 誰が契約に署名するか;champion-led adoption は通常 B2B 優先順位付けで勝つ、buyer が「より上位」のペルソナであっても
+- **Adoption multiplier** — 誰の採用が組織全体の展開を解除するか(例:HR Specialist の毎日の使用は他のペルソナが後で依存する system-of-record の種を蒔く)
+- **Switching-trigger ownership** — どのペルソナが既存ツールからの切替えを正当化する痛みを感じているか;switching trigger を所有するペルソナは、最重使用者でなくても優先順位付け候補
+- **Budget authority** — 予算項目を誰が管理するか;buyer ≠ user の場合に関連、buyer の評価基準が初期取引決定を支配
+- **Audit / compliance pressure ownership** — 監査所見が発生した際に誰の役職がリスクにさらされるか;規制対象 B2B セグメントでは、コンプレッシャーを背負うペルソナが優先順位付けを支配することが多い
+純粋な「ペルソナ X は使用頻度が高い」または「ペルソナ Y はユーザー数が多い」という reasoning は B2B 製品にとってこの Hard Gate に FAIL します。頻度は必要条件であり、決して十分ではありません — B2B 切替えは組織レベルのプレッシャーによって駆動され、個別の使用率ではありません。
+**B2C 製品**の場合、reasoning は少なくとも 1 つを参照する必要があります:switching-trigger ownership、JTBD severity differential、network-effect seeding、または willingness-to-pay differential。純頻度 reasoning は B2C にも FAIL します。
 ### 📝 ペルソナ品質チェックリスト
 - ✅ セグメンテーションは「目的/タスク/モチベーション」に基づいているか？（デモグラフィックスではなく）
 - ✅ ペルソナはMECE（相互排他的でターゲット市場を網羅）か？