product-playbook 1.2.8 → 1.2.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (102) hide show
  1. package/.claude-plugin/marketplace.json +1 -1
  2. package/README.es.md +29 -2
  3. package/README.ja.md +29 -2
  4. package/README.ko.md +29 -2
  5. package/README.md +29 -2
  6. package/README.zh-CN.md +29 -2
  7. package/README.zh-TW.md +29 -2
  8. package/SKILL.md +82 -10
  9. package/agents/strategy-critic.md +61 -2
  10. package/commands/product-feature.md +1 -1
  11. package/hooks/hooks.json +5 -0
  12. package/hooks/user-prompt-detect-specialist-dispatch.py +141 -0
  13. package/i18n/es/SKILL.md +1 -1
  14. package/i18n/es/references/02a-persona.md +42 -0
  15. package/i18n/es/references/02b-jtbd.md +15 -8
  16. package/i18n/ja/SKILL.md +1 -1
  17. package/i18n/ja/references/02a-persona.md +42 -0
  18. package/i18n/ja/references/02b-jtbd.md +15 -8
  19. package/i18n/ko/SKILL.md +1 -1
  20. package/i18n/ko/references/02a-persona.md +42 -0
  21. package/i18n/ko/references/02b-jtbd.md +15 -8
  22. package/i18n/zh-CN/SKILL.md +1 -1
  23. package/i18n/zh-CN/references/02a-persona.md +42 -0
  24. package/i18n/zh-CN/references/02b-jtbd.md +15 -8
  25. package/i18n/zh-TW/SKILL.md +1 -1
  26. package/i18n/zh-TW/references/02a-persona.md +42 -0
  27. package/i18n/zh-TW/references/02b-jtbd.md +15 -8
  28. package/install.sh +20 -24
  29. package/package.json +2 -1
  30. package/references/00-opportunity-check.md +29 -29
  31. package/references/01-strategy.md +56 -56
  32. package/references/02a-persona.md +79 -37
  33. package/references/02b-jtbd.md +116 -103
  34. package/references/02c-ost-journey.md +38 -38
  35. package/references/03-define.md +73 -73
  36. package/references/04a-prfaq.md +111 -69
  37. package/references/04b-solutions.md +134 -134
  38. package/references/04c-mvp.md +10 -10
  39. package/references/05a-northstar-aha.md +64 -64
  40. package/references/05b-pmf-gtm.md +70 -70
  41. package/references/05c-validation-spec.md +69 -69
  42. package/references/06-html-report.md +96 -91
  43. package/references/07a-handoff-core.md +93 -93
  44. package/references/07b-tasks-tickets.md +149 -149
  45. package/references/07c-architecture-setup.md +98 -98
  46. package/references/08-security-checklist.md +138 -138
  47. package/references/rules-build.md +84 -84
  48. package/references/rules-change-propagation.md +48 -48
  49. package/references/rules-commands.md +89 -89
  50. package/references/rules-context.md +6 -0
  51. package/references/rules-end-of-flow.md +112 -112
  52. package/references/rules-product-type.md +12 -12
  53. package/references/rules-progress.md +39 -39
  54. package/references/rules-quality-review.md +22 -5
  55. package/references/rules-quick.md +13 -13
  56. package/references/rules-revision.md +26 -0
  57. package/references/rules-subagent-dispatch.md +6 -1
  58. package/i18n/en/SKILL.md +0 -195
  59. package/i18n/en/commands/product-build.md +0 -13
  60. package/i18n/en/commands/product-dev.md +0 -24
  61. package/i18n/en/commands/product-feature.md +0 -15
  62. package/i18n/en/commands/product-full.md +0 -13
  63. package/i18n/en/commands/product-prd.md +0 -14
  64. package/i18n/en/commands/product-quick.md +0 -13
  65. package/i18n/en/commands/product-report.md +0 -12
  66. package/i18n/en/commands/product-revision.md +0 -13
  67. package/i18n/en/references/00-opportunity-check.md +0 -44
  68. package/i18n/en/references/01-strategy.md +0 -90
  69. package/i18n/en/references/02a-persona.md +0 -57
  70. package/i18n/en/references/02b-jtbd.md +0 -137
  71. package/i18n/en/references/02c-ost-journey.md +0 -65
  72. package/i18n/en/references/03-define.md +0 -118
  73. package/i18n/en/references/04a-prfaq.md +0 -112
  74. package/i18n/en/references/04b-solutions.md +0 -269
  75. package/i18n/en/references/04c-mvp.md +0 -21
  76. package/i18n/en/references/05a-northstar-aha.md +0 -93
  77. package/i18n/en/references/05b-pmf-gtm.md +0 -102
  78. package/i18n/en/references/05c-validation-spec.md +0 -117
  79. package/i18n/en/references/06-html-report.md +0 -128
  80. package/i18n/en/references/07a-handoff-core.md +0 -152
  81. package/i18n/en/references/07b-tasks-tickets.md +0 -215
  82. package/i18n/en/references/07c-architecture-setup.md +0 -199
  83. package/i18n/en/references/08-security-checklist.md +0 -221
  84. package/i18n/en/references/rules-build.md +0 -152
  85. package/i18n/en/references/rules-change-propagation.md +0 -74
  86. package/i18n/en/references/rules-commands.md +0 -98
  87. package/i18n/en/references/rules-context-template.md +0 -177
  88. package/i18n/en/references/rules-context.md +0 -123
  89. package/i18n/en/references/rules-custom.md +0 -77
  90. package/i18n/en/references/rules-document-tools.md +0 -126
  91. package/i18n/en/references/rules-end-of-flow.md +0 -150
  92. package/i18n/en/references/rules-export-document.md +0 -346
  93. package/i18n/en/references/rules-file-integration.md +0 -65
  94. package/i18n/en/references/rules-full.md +0 -102
  95. package/i18n/en/references/rules-import-document.md +0 -261
  96. package/i18n/en/references/rules-optional-trigger.md +0 -118
  97. package/i18n/en/references/rules-product-type.md +0 -14
  98. package/i18n/en/references/rules-progress.md +0 -60
  99. package/i18n/en/references/rules-quality-review.md +0 -48
  100. package/i18n/en/references/rules-quick.md +0 -29
  101. package/i18n/en/references/rules-revision.md +0 -95
  102. package/i18n/en/references/rules-subagent-dispatch.md +0 -61
@@ -7,7 +7,7 @@
7
7
  {
8
8
  "name": "product-playbook",
9
9
  "description": "MUST use when user wants to plan or strategize a product/feature. 22 PM frameworks, 6 modes, multi-language, from idea to dev handoff",
10
- "version": "1.2.5",
10
+ "version": "1.2.9",
11
11
  "source": "./."
12
12
  }
13
13
  ]
package/README.es.md CHANGED
@@ -479,6 +479,33 @@ Una iteración de reducción de tokens. Misma semántica del contenido del skill
479
479
 
480
480
  **Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko) preservando las traducciones existentes — el adelgazamiento estructural se aplicó de manera idéntica por idioma.
481
481
 
482
+ ### Iteración 7: Resiliencia del Harness de Evals (Sprint 1 + 2A, v1.2.9)
483
+
484
+ Una iteración a nivel de harness, no a nivel de skill. La semántica del skill no cambió; lo que cambió es la *superficie que se mide*. Objetivo: hacer visible la línea base real de calidad desbloqueando 4 evals que venían produciendo veredictos 0/0 en silencio.
485
+
486
+ **Sprint 1 — desbloquear los clusters no medibles (`d2023fb`, `cee67cb`):**
487
+
488
+ Cuatro evals (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) venían produciendo 0 pass / 0 fail por corrida — indistinguibles de "sin problemas" en el puntaje agregado. Tres causas:
489
+
490
+ 1. **Sub-agents faltantes en CI headless** — CI instalaba el skill en `~/.claude/skills/` pero nunca copiaba `agents/*.md` a `~/.claude/agents/`. `claude -p` por lo tanto no podía despachar vía `Task`, y el orchestrator corría inline en silencio.
491
+ 2. **El hook de specialist-dispatch silencioso bajo `claude -p`** — los `hooks/` a nivel de plugin no se cargan en modo headless; sólo se cargan los UserPromptSubmit hooks de `~/.claude/settings.json` a nivel de usuario. CI ahora registra programáticamente el dispatch hook a nivel de usuario antes de cada corrida behavioral.
492
+ 3. **Timeouts de response + judge demasiado agresivos** — 180s response / 120s judge cortaban en medio las salidas largas de Discovery y Pre-mortem; el judge entonces veía un string truncado y emitía 0/0. Subido a 600s / 240s con un reintento ante salidas no-JSON.
493
+
494
+ También se eliminaron las expectations procedurales tipo "el orchestrator delega vía Task tool" de los evals 10/11/12 — esas son inverificables en `claude -p` (sin superficie de Task anidado) y no son la propiedad que finalmente nos importa. Las expectations restantes apuntan a la *calidad de output* que el specialist habría producido.
495
+
496
+ **Sprint 2A — robustez del judge + techo de CI (`f973939`):**
497
+
498
+ Dos correcciones de seguimiento del code review del PR #9:
499
+
500
+ 1. **El reintento de repair del judge preserva el contexto original** — `claude -p` es stateless, así que el repair prompt ahora vuelve a incluir el `judge_prompt` original completo (response + expectations) más la salida malformed anterior. Una nueva verificación `_judge_output_complete()` rechaza payloads que no tengan exactamente N expectations indexadas, evitando que el modelo emita un veredicto plausible-pero-fabricado cuando la salida del primer call es irrecuperable.
501
+ 2. **Timeout del job `behavioral-eval` de CI 90 → 120 min** — el peor caso = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 min, así que el techo previo de 90 min podía cancelar en silencio una corrida válida. 120 min deja ~10 min de margen para setup + artifact upload.
502
+
503
+ **Línea base recién visible** (corrida local, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectations pasando, **6 critical + 14 warning** failures. El puntaje agregado no regresó — lo que regresó es el puntaje *visible*, porque cuatro evals que antes contribuían 0/0 ahora producen señal real. Los 6 critical failures son el backlog explícito de Stage 2: JTBD de 3 capas (funcional / emocional / social), Jobs a nivel organizacional B2B, separación de persona buyer vs user en B2B, guardarraíles de scope de Discovery, y disciplina de leading-indicator en pre-mortem. Ver [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) para el desglose por expectation.
504
+
505
+ **Las mejoras de harness viven en `evals/` y `.github/workflows/` — no se publican a npm.** No hace falta version bump más allá de v1.2.9 (que ya cargó el hook a nivel de usuario y las ediciones de scope a los evals 10/11/12).
506
+
507
+ **Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko).
508
+
482
509
  ---
483
510
 
484
511
  ## 🧪 Desarrollo y Evals
@@ -487,7 +514,7 @@ El directorio `evals/` incluye dos suites de pruebas complementarias y un scorer
487
514
 
488
515
  **Local (gratis, recomendado)**: ejecuta los mismos scripts con el CLI `claude` autenticado con tu suscripción Claude Pro/Max (un solo `claude login`). Sin API key, sin costo marginal. El sistema de eval está diseñado para correr localmente antes de cada release.
489
516
 
490
- **CI (opcional, pago)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI requiere el secret `ANTHROPIC_API_KEY` (GitHub Actions no puede usar OAuth en un contenedor headless); sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
517
+ **CI (opcional, sin costo adicional)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI también usa tu suscripción Claude Pro/Max (sin API key, sin costo por token): configuración única, ejecuta `claude setup-token` localmente y agregá el token impreso como secret `CLAUDE_CODE_OAUTH_TOKEN` del repo. Sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
491
518
 
492
519
  ### Ejecución local
493
520
 
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none # solo reporta, sin exit 1
508
535
  python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
509
536
  ```
510
537
 
511
- Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y requiere el secret `ANTHROPIC_API_KEY`.
538
+ Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y la misma suscripción vía un secret `CLAUDE_CODE_OAUTH_TOKEN` (generado una sola vez con `claude setup-token`).
512
539
 
513
540
  ### Severity y scoring
514
541
 
package/README.ja.md CHANGED
@@ -480,6 +480,33 @@ token 削減イテレーション。スキル内容のセマンティクスは
480
480
 
481
481
  **5 つの i18n ロケール(zh-TW、zh-CN、ja、es、ko)にミラー** — 既存の翻訳を保持しつつ、構造的なスリム化を言語ごとに同一に適用。
482
482
 
483
+ ### イテレーション7:Eval Harness のレジリエンス強化(Sprint 1 + 2A、v1.2.9)
484
+
485
+ ハーネス層のイテレーションであって、スキル層ではない。スキルのセマンティクスは変わっていない。変わったのは**測定対象の表面積**。目標は、ずっと 0/0 verdict を黙って出していた 4 つの eval を解除し、真の品質ベースラインを浮かび上がらせること。
486
+
487
+ **Sprint 1 — 測定不能だったクラスタの解除(`d2023fb`、`cee67cb`):**
488
+
489
+ 4 つの eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)は毎回 0 pass / 0 fail を返しており、集計スコアでは「問題なし」と区別がつかなかった。3 つの原因:
490
+
491
+ 1. **headless CI で sub-agent が欠落** — CI はスキルを `~/.claude/skills/` にインストールしていたが、`agents/*.md` を `~/.claude/agents/` にコピーしていなかった。そのため `claude -p` は `Task` 経由で dispatch できず、orchestrator が黙って inline で実行していた。
492
+ 2. **`claude -p` 下で specialist-dispatch hook が無音** — plugin レベルの `hooks/` は headless モードでは読み込まれず、user レベルの `~/.claude/settings.json` の UserPromptSubmit hook のみが読み込まれる。CI は各 behavioral run の前に dispatch hook を user レベルにプログラム的に登録するようになった。
493
+ 3. **Response + judge の timeout が短すぎた** — 180s response / 120s judge では長文の Discovery / Pre-mortem 出力が途中で切れ、judge は切り詰められた文字列を見て 0/0 を吐いていた。600s / 240s に引き上げ、非 JSON 出力時には 1 回リトライ。
494
+
495
+ また evals 10/11/12 から「orchestrator が Task ツール経由で dispatch する」という手続き的 expectation も削除した — `claude -p` には nested Task の表面がなく検証不能で、最終的に我々が気にする性質でもない。残りの expectation は specialist が産出すべき**アウトプット品質**を対象とする。
496
+
497
+ **Sprint 2A — judge のロバストネス + CI 上限(`f973939`):**
498
+
499
+ PR #9 のコードレビューからの 2 つのフォローアップ修正:
500
+
501
+ 1. **Judge 修復リトライがオリジナルの context を保持** — `claude -p` はステートレスなので、修復 prompt は完全な元の `judge_prompt`(response + expectations)と前回の malformed output を再投入するようになった。新しい `_judge_output_complete()` チェックは N 個の indexed expectation がぴったり揃っていないペイロードを拒否し、初回出力が回復不能なときに model が「形だけ整った捏造 verdict」を吐くのを防ぐ。
502
+ 2. **CI `behavioral-eval` ジョブの timeout 90 → 120 分** — 最悪ケース = 12 evals / 2 workers × (600s response + 240s judge + 240s repair)≈ 108 分なので、以前の 90 分上限は有効な run を黙って cancel する可能性があった。120 分は setup + artifact upload に ~10 分の余裕を残す。
503
+
504
+ **新たに可視化されたベースライン**(ローカル run、2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation がパス、**6 critical + 14 warning** の失敗。集計スコアは退行していない — 退行したのは**可視**スコアで、これまで 0/0 を貢献していた 4 つの eval が今は実シグナルを返すようになったため。この 6 つの critical 失敗が Stage 2 の明示的な backlog:3 層 JTBD(functional / emotional / social)、B2B 組織レベルの Jobs、B2B buyer vs user persona の分離、Discovery scope のガードレール、pre-mortem の leading-indicator 規律。expectation 単位の内訳は [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) を参照。
505
+
506
+ **ハーネス改善は `evals/` と `.github/workflows/` に住み、npm には出荷されない。** v1.2.9 を超えるバージョンバンプは不要(v1.2.9 にはすでに user-level hook と evals 10/11/12 の scope 調整が含まれている)。
507
+
508
+ **5 つの i18n ロケール(zh-TW、zh-CN、ja、es、ko)にミラー**。
509
+
483
510
  ---
484
511
 
485
512
  ## 🧪 開発と評価
@@ -488,7 +515,7 @@ token 削減イテレーション。スキル内容のセマンティクスは
488
515
 
489
516
  **ローカル(無料、推奨)**:`claude` CLI を Claude Pro/Max サブスクリプションで認証して(一度だけ `claude login`)同じスクリプトを実行できます。API key 不要、追加コストなし。eval システムは各リリース前にローカルで実行する設計です。
490
517
 
491
- **CI(オプション、有料)**:`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI には `ANTHROPIC_API_KEY` secret が必要です(GitHub Actions headless コンテナで OAuth が使えません)。secret 未設定時は eval job が**クリーンに skip**(グレー ⏭️)され、誤解を招く赤バツは出ません。
518
+ **CI(オプション、追加課金なし)**:`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI もあなたの Claude Pro/Max サブスクリプションを使用します(API key 不要、トークン課金なし):ローカルで `claude setup-token` を一度実行し、出力されたトークンを repo secret `CLAUDE_CODE_OAUTH_TOKEN` として追加してください。secret 未設定時は eval job が**クリーンに skip**(グレー ⏭️)され、誤解を招く赤バツは出ません。
492
519
 
493
520
  ### ローカル実行
494
521
 
@@ -509,7 +536,7 @@ python3 evals/run_behavioral_eval.py --fail-on none # レポートのみ、exi
509
536
  python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
510
537
  ```
511
538
 
512
- ローカルは `--runs 3` がデフォルト(多数決で LLM のばらつきを吸収)。`claude` CLI は Claude Pro/Max の OAuth セッション(`claude login`)を使うため、トークン課金はありません。CI は `--runs 1` で、`ANTHROPIC_API_KEY` secret が必要です。
539
+ ローカルは `--runs 3` がデフォルト(多数決で LLM のばらつきを吸収)。`claude` CLI は Claude Pro/Max の OAuth セッション(`claude login`)を使うため、トークン課金はありません。CI は `--runs 1` で、同じサブスクリプションを `CLAUDE_CODE_OAUTH_TOKEN` secret 経由で利用します(`claude setup-token` で一度だけ生成)。
513
540
 
514
541
  ### Severity とスコアリング
515
542
 
package/README.ko.md CHANGED
@@ -479,6 +479,33 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
479
479
 
480
480
  **5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링** — 기존 번역을 보존하며, 구조적 슬림화는 언어별로 동일하게 적용.
481
481
 
482
+ ### 반복 7: Eval Harness 회복탄력성 강화 (Sprint 1 + 2A, v1.2.9)
483
+
484
+ 스킬 레벨이 아니라 harness 레벨의 반복. 스킬의 의미는 바뀌지 않았고, 바뀐 것은 **측정되는 표면**. 목표는 0/0 verdict 만 조용히 내고 있던 4개 eval 의 차단을 풀어, 진짜 품질 베이스라인을 드러내는 것.
485
+
486
+ **Sprint 1 — 측정 불능 클러스터 해제 (`d2023fb`, `cee67cb`):**
487
+
488
+ 4개 eval (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) 이 매번 0 pass / 0 fail 을 내고 있어, 집계 점수에서 「문제 없음」과 구분이 불가능했음. 세 가지 원인:
489
+
490
+ 1. **headless CI 에서 sub-agent 누락** — CI 가 스킬을 `~/.claude/skills/` 에 설치하면서 `agents/*.md` 를 `~/.claude/agents/` 에 복사하지 않았음. 그래서 `claude -p` 는 `Task` 로 dispatch 할 수 없었고, orchestrator 가 조용히 inline 으로 실행했음.
491
+ 2. **`claude -p` 에서 specialist-dispatch hook 이 무음** — 플러그인 레벨 `hooks/` 는 headless 모드에서 로드되지 않으며, user 레벨 `~/.claude/settings.json` 의 UserPromptSubmit hook 만 로드됨. CI 는 이제 각 behavioral run 전에 dispatch hook 을 프로그램적으로 user 레벨에 등록함.
492
+ 3. **Response + judge timeout 이 너무 빡빡함** — 180s response / 120s judge 가 장문의 Discovery / Pre-mortem 출력을 중간에 잘랐고, judge 는 잘린 문자열을 보고 0/0 을 뱉었음. 600s / 240s 로 올리고, 비-JSON 출력 시 1회 재시도.
493
+
494
+ 또한 evals 10/11/12 에서 「orchestrator 가 Task tool 로 dispatch」 같은 절차적 expectation 을 제거 — `claude -p` 에 nested Task 표면이 없어 검증 불가하며, 우리가 최종적으로 신경 쓰는 성질도 아님. 남은 expectation 은 specialist 가 산출했어야 할 **출력 품질**을 대상으로 함.
495
+
496
+ **Sprint 2A — judge 견고성 + CI 상한 (`f973939`):**
497
+
498
+ PR #9 코드 리뷰의 두 가지 후속 수정:
499
+
500
+ 1. **Judge repair 재시도가 원본 context 보존** — `claude -p` 는 stateless 이므로, repair prompt 는 원본 `judge_prompt` (response + expectations) 전체와 이전 malformed output 을 다시 포함함. 새 `_judge_output_complete()` 체크가 정확히 N 개의 indexed expectation 이 없는 payload 를 거부하여, 첫 호출 출력이 복구 불가능할 때 model 이 「형태만 그럴듯한 위조 verdict」를 내는 것을 방지.
501
+ 2. **CI `behavioral-eval` 작업 timeout 90 → 120 분** — 최악 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 분이므로, 이전 90분 상한은 유효한 run 을 조용히 cancel 할 수 있었음. 120분은 setup + artifact upload 에 ~10분 여유.
502
+
503
+ **새로 가시화된 베이스라인** (로컬 run, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectation 통과, **6 critical + 14 warning** 실패. 집계 점수가 퇴보한 것이 아니라, **가시적**인 점수가 퇴보한 것 — 이전에 0/0 을 기여하던 4개 eval 이 이제 실제 signal 을 냄. 이 6개 critical 실패가 Stage 2 의 명시적 backlog: 3계층 JTBD (functional / emotional / social), B2B 조직 수준 Jobs, B2B buyer vs user persona 분리, Discovery scope 가드레일, pre-mortem leading-indicator 규율. expectation 단위 내역은 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) 참조.
504
+
505
+ **Harness 개선은 `evals/` 와 `.github/workflows/` 에 거주하며, npm 으로 출하되지 않음.** v1.2.9 이상의 버전 bump 불필요 (v1.2.9 가 이미 user-level hook 과 evals 10/11/12 의 scope 조정을 포함).
506
+
507
+ **5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링**.
508
+
482
509
  ---
483
510
 
484
511
  ## 🧪 개발 및 평가
@@ -487,7 +514,7 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
487
514
 
488
515
  **로컬 (무료, 권장)**: `claude` CLI를 Claude Pro/Max 구독으로 인증해서 (한 번만 `claude login`) 같은 스크립트를 실행합니다. API key 불필요, 추가 비용 없음. eval 시스템은 각 릴리스 전에 로컬에서 실행하도록 설계되었습니다.
489
516
 
490
- **CI (선택, 유료)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI `ANTHROPIC_API_KEY` secret이 필요합니다 (GitHub Actions는 headless 컨테이너에서 OAuth를 사용할 없습니다). secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
517
+ **CI (선택, 추가 과금 없음)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI Claude Pro/Max 구독을 사용합니다 (API key 불필요, 토큰 과금 없음): 로컬에서 `claude setup-token`을 한 번 실행하고, 출력된 토큰을 repo secret `CLAUDE_CODE_OAUTH_TOKEN`으로 추가하세요. secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
491
518
 
492
519
  ### 로컬 실행
493
520
 
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none # 보고만, exit 1 없음
508
535
  python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
509
536
  ```
510
537
 
511
- 로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 `ANTHROPIC_API_KEY` secret 필요합니다.
538
+ 로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 동일한 구독을 `CLAUDE_CODE_OAUTH_TOKEN` secret 통해 인증합니다 (`claude setup-token`으로 한 번만 생성).
512
539
 
513
540
  ### Severity 및 스코어링
514
541
 
package/README.md CHANGED
@@ -477,6 +477,33 @@ A token-reduction iteration. Same skill content semantics, smaller footprint per
477
477
 
478
478
  **Mirrored to 5 i18n locales** (zh-TW, zh-CN, ja, es, ko) preserving existing translations — structural slim applied identically per language.
479
479
 
480
+ ### Iteration 7: Eval Harness Resilience (Sprint 1 + 2A, v1.2.9)
481
+
482
+ A harness-level iteration, not a skill-level one. No skill semantics changed; the *surface area being measured* did. Goal: surface the real quality baseline by unblocking 4 evals that had been silently producing 0/0 verdicts.
483
+
484
+ **Sprint 1 — unblock unmeasurable clusters (`d2023fb`, `cee67cb`):**
485
+
486
+ Four evals (`eval-jtbd-depth`, `eval-prfaq-output`, `eval-subagent-discovery`, `eval-subagent-premortem`) had been producing 0 passes / 0 fails per run — indistinguishable from "no problems" in the aggregate score. Three causes:
487
+
488
+ 1. **Sub-agents missing in headless CI** — CI installed the skill at `~/.claude/skills/` but never copied `agents/*.md` to `~/.claude/agents/`. `claude -p` therefore couldn't dispatch via `Task`, and the orchestrator silently inline-ran.
489
+ 2. **Specialist-dispatch hook silent under `claude -p`** — plugin-level `hooks/` are not loaded in headless mode; only user-level `~/.claude/settings.json` UserPromptSubmit hooks are. CI now programmatically registers the dispatch hook at the user level before each behavioral run.
490
+ 3. **Response + judge timeouts too aggressive** — 180s response / 120s judge cut off long-form Discovery and Pre-mortem outputs mid-thought; the judge then saw a truncated string and emitted 0/0. Bumped to 600s / 240s with a single retry on non-JSON output.
491
+
492
+ Also dropped procedural "orchestrator delegates via Task tool" expectations from evals 10/11/12 — those are unverifiable in `claude -p` (no nested Task surface) and not the property we ultimately care about. The remaining expectations target the *output quality* the specialist would have produced.
493
+
494
+ **Sprint 2A — judge robustness + CI ceiling (`f973939`):**
495
+
496
+ Two follow-on fixes from PR #9 code review:
497
+
498
+ 1. **Judge repair retry preserves original context** — `claude -p` is stateless, so the repair prompt now re-includes the full original `judge_prompt` (response + expectations) plus the previous malformed output. A new `_judge_output_complete()` check rejects payloads that don't have exactly N indexed expectations, preventing the model from emitting a plausibly-shaped but fabricated verdict when the first call's output is unrecoverable.
499
+ 2. **CI `behavioral-eval` job timeout 90 → 120 min** — worst case = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 min, so the previous 90-min ceiling could silently cancel an otherwise valid run. 120 min leaves ~10 min headroom for setup + artifact upload.
500
+
501
+ **Newly visible baseline** (local run, 2026-05-28): **0 / 100** `at-risk`, **13 / 33** expectations passing, **6 critical + 14 warning** failures. The aggregate score did not regress — what regressed is the *visible* score, because four evals that previously contributed 0/0 now produce real signal. The 6 critical failures are now the explicit Stage 2 backlog: 3-layer JTBD (functional / emotional / social), B2B organization-level Jobs, B2B buyer-vs-user persona separation, Discovery-scope guardrails, and pre-mortem leading-indicator discipline. See [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md) for the per-expectation breakdown.
502
+
503
+ **Harness improvements live in `evals/` and `.github/workflows/` — they do not ship to npm.** No version bump beyond v1.2.9 (which carried the user-level hook + scope edits to evals 10/11/12).
504
+
505
+ **Mirrored to 5 i18n locales** (zh-TW, zh-CN, ja, es, ko).
506
+
480
507
  ---
481
508
 
482
509
  ## 🧪 Development & Evals
@@ -485,7 +512,7 @@ The `evals/` directory ships two complementary test suites and a deterministic s
485
512
 
486
513
  **Local (free, recommended):** run the same scripts with the `claude` CLI authenticated via your Claude Pro/Max subscription (`claude login` once). No API key, no marginal cost. The eval system is designed to be run locally before each release.
487
514
 
488
- **CI (optional, paid):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI requires an `ANTHROPIC_API_KEY` secret because GitHub Actions cannot use OAuth in a headless container; without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
515
+ **CI (optional, no extra billing):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI runs on your Claude Pro/Max subscription (no API key, no per-token cost): one-time setup is `claude setup-token` locally, then add the printed token as repo secret `CLAUDE_CODE_OAUTH_TOKEN`. Without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
489
516
 
490
517
  ### Running locally
491
518
 
@@ -506,7 +533,7 @@ python3 evals/run_behavioral_eval.py --fail-on none # report without exit 1
506
533
  python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
507
534
  ```
508
535
 
509
- Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and requires the `ANTHROPIC_API_KEY` secret.
536
+ Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and the same subscription via a `CLAUDE_CODE_OAUTH_TOKEN` secret (generated once with `claude setup-token`).
510
537
 
511
538
  ### Severity & scoring
512
539
 
package/README.zh-CN.md CHANGED
@@ -479,6 +479,33 @@ Claude Code 会自动:
479
479
 
480
480
  **已同步至 5 个 i18n 语系**(zh-TW、zh-CN、ja、es、ko),保留既有译文 —— 结构性瘦身按语系一致套用。
481
481
 
482
+ ### Iteration 7:Eval Harness 韧性强化(Sprint 1 + 2A,v1.2.9)
483
+
484
+ Harness 层迭代,不是 skill 层。Skill 语意没变,变的是**被测量的表面**。目标:解除 4 个一直在悄悄产出 0/0 verdict 的 eval,让真实品质基线浮出水面。
485
+
486
+ **Sprint 1 — 解锁无法测量的群集(`d2023fb`、`cee67cb`):**
487
+
488
+ 4 个 eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)每次都产出 0 pass / 0 fail,在汇总分数中与「没问题」无法区分。三个原因:
489
+
490
+ 1. **CI headless 模式缺少 sub-agent** —— CI 把 skill 装到 `~/.claude/skills/`,却没把 `agents/*.md` 复制到 `~/.claude/agents/`。`claude -p` 因此无法透过 `Task` 派发,orchestrator 只能默默 inline 执行。
491
+ 2. **Specialist-dispatch hook 在 `claude -p` 不会载入** —— plugin 层的 `hooks/` 在 headless 模式不会载入,只有 user 层 `~/.claude/settings.json` 的 UserPromptSubmit hook 会。CI 现在会在每次 behavioral run 之前以程序方式把 dispatch hook 注册到 user 层。
492
+ 3. **Response + judge timeout 太紧** —— 180s response / 120s judge 会把长篇 Discovery、Pre-mortem 输出中途切断,judge 看到截断字串就吐出 0/0。提升到 600s / 240s,且非 JSON 输出时重试一次。
493
+
494
+ 同时也从 evals 10/11/12 删掉「orchestrator 必须透过 Task 派发」这类程序性 expectation —— 在 `claude -p` 没有 nested Task 介面,无法验证,也不是我们最终在意的性质。留下的 expectation 都针对 specialist 应产出的**输出质量**。
495
+
496
+ **Sprint 2A — judge 韧性 + CI 上限(`f973939`):**
497
+
498
+ PR #9 review 之后的两个跟进修正:
499
+
500
+ 1. **Judge 修复重试保留原始 context** —— `claude -p` 是无状态的,所以修复 prompt 现在会重新带入完整原始 `judge_prompt`(response + expectations)加上前一次的 malformed output。新的 `_judge_output_complete()` 检查会拒绝「没有完整 N 个 indexed expectation」的回应,避免 model 在第一次输出无法救援时凭空捏造一份看起来合理的 verdict。
501
+ 2. **CI `behavioral-eval` job timeout 90 → 120 分钟** —— 最坏情况 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 分钟,先前 90 分钟上限可能默默 cancel 整轮 run。120 分钟给 setup + artifact upload 留 ~10 分钟余裕。
502
+
503
+ **新可见的基线**(本机 run,2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation 通过、**6 critical + 14 warning** 失败。汇总分数并没有退步,退的是**可见**分数 —— 四个原本贡献 0/0 的 eval 现在开始产出真实 signal。这 6 个 critical 失败就是 Stage 2 明确的待修清单:三层 JTBD(functional / emotional / social)、B2B 组织层 Jobs、B2B buyer vs user persona 分离、Discovery scope 守备、pre-mortem leading-indicator 纪律。逐项细节见 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md)。
504
+
505
+ **Harness 改进住在 `evals/` 与 `.github/workflows/`,不会发到 npm。** 版本不需要再往 v1.2.9 之上 bump(v1.2.9 已经包含 user-level hook 与 evals 10/11/12 的 scope 调整)。
506
+
507
+ **已同步至 5 个 i18n 语系**(zh-TW、zh-CN、ja、es、ko)。
508
+
482
509
  ---
483
510
 
484
511
  ## 🧪 开发与评测
@@ -487,7 +514,7 @@ Claude Code 会自动:
487
514
 
488
515
  **本地(免费,推荐)**:用 `claude` CLI 搭配你的 Claude Pro/Max 订阅(先 `claude login` 一次)跑这些 script。不需要 API key、没有额外成本。整套 eval 系统就是设计来在每次发版前本地跑一遍。
489
516
 
490
- **CI(可选,付费)**:`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`(含 `package.json` 变动)时跑这两套,把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 需要 `ANTHROPIC_API_KEY` secret(GitHub Actions headless 容器无法走 OAuth);没设 secret 时 eval job **会干净地 skip**(灰色 ⏭️),不会出现误导的红叉。
517
+ **CI(可选,不额外计费)**:`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`(含 `package.json` 变动)时跑这两套,把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 同样走你的 Claude Pro/Max 订阅(不需 API key、没有按 token 计费的成本):一次性设置为本机 `claude setup-token` 生成长期 token,把它加进 repo secret `CLAUDE_CODE_OAUTH_TOKEN`。没设 secret 时 eval job **会干净地 skip**(灰色 ⏭️),不会出现误导的红叉。
491
518
 
492
519
  ### 本地执行
493
520
 
@@ -508,7 +535,7 @@ python3 evals/run_behavioral_eval.py --fail-on none # 只报告,不 exit 1
508
535
  python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
509
536
  ```
510
537
 
511
- 本地默认 `--runs 3`(多数决可吸收 LLM 变异性);`claude` CLI 走你的 Claude Pro/Max OAuth session(`claude login`),没有按 token 计费的成本。CI 用 `--runs 1` 并需要 `ANTHROPIC_API_KEY` secret。
538
+ 本地默认 `--runs 3`(多数决可吸收 LLM 变异性);`claude` CLI 走你的 Claude Pro/Max OAuth session(`claude login`),没有按 token 计费的成本。CI 用 `--runs 1`,靠同一个订阅通过 `CLAUDE_CODE_OAUTH_TOKEN` secret 认证(用 `claude setup-token` 一次性生成)。
512
539
 
513
540
  ### Severity 与计分
514
541
 
package/README.zh-TW.md CHANGED
@@ -478,6 +478,33 @@ Claude Code 會自動:
478
478
 
479
479
  **5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko),保留既有翻譯——結構性瘦身在各語系等比例套用。
480
480
 
481
+ ### Iteration 7:Eval Harness 韌性強化(Sprint 1 + 2A,v1.2.9)
482
+
483
+ Harness 層的迭代,不是 skill 層。Skill 語意沒變,變的是**被測量的表面**。目標:解除 4 個一直在悄悄產出 0/0 verdict 的 eval,讓真實品質基線浮出水面。
484
+
485
+ **Sprint 1 — 解鎖無法測量的群集(`d2023fb`、`cee67cb`):**
486
+
487
+ 4 個 eval(`eval-jtbd-depth`、`eval-prfaq-output`、`eval-subagent-discovery`、`eval-subagent-premortem`)每次都產出 0 pass / 0 fail,在彙總分數中與「沒問題」無法區分。三個原因:
488
+
489
+ 1. **CI headless 模式缺少 sub-agent** — CI 把 skill 裝到 `~/.claude/skills/`,卻沒把 `agents/*.md` 複製到 `~/.claude/agents/`。`claude -p` 因此無法透過 `Task` 派發,orchestrator 只能默默 inline 執行。
490
+ 2. **Specialist-dispatch hook 在 `claude -p` 不會載入** — plugin 層的 `hooks/` 在 headless 模式不會載入,只有 user 層 `~/.claude/settings.json` 的 UserPromptSubmit hook 會。CI 現在會在每次 behavioral run 之前以程式碼方式把 dispatch hook 註冊到 user 層。
491
+ 3. **Response + judge timeout 太緊** — 180s response / 120s judge 會把長篇 Discovery、Pre-mortem 輸出中途切斷,judge 看到截斷字串就吐出 0/0。提升到 600s / 240s,且非 JSON 輸出時重試一次。
492
+
493
+ 同時也從 evals 10/11/12 刪掉「orchestrator 必須透過 Task 派發」這類程序性 expectation——在 `claude -p` 沒有 nested Task 介面,無法驗證,也不是我們最終在意的性質。留下的 expectation 都針對 specialist 應產出的**輸出品質**。
494
+
495
+ **Sprint 2A — judge 韌性 + CI 上限(`f973939`):**
496
+
497
+ PR #9 review 之後的兩個跟進修正:
498
+
499
+ 1. **Judge 修復重試保留原始 context** — `claude -p` 是無狀態的,所以修復 prompt 現在會重新帶入完整原始 `judge_prompt`(response + expectations)加上前一次的 malformed output。新的 `_judge_output_complete()` 檢查會拒絕「沒有完整 N 個 indexed expectation」的回應,避免 model 在第一次輸出無法救援時憑空捏造一份看起來合理的 verdict。
500
+ 2. **CI `behavioral-eval` job timeout 90 → 120 分鐘** — 最壞情況 = 12 evals / 2 workers × (600s response + 240s judge + 240s repair) ≈ 108 分鐘,先前 90 分鐘上限可能默默 cancel 整輪 run。120 分鐘給 setup + artifact upload 留 ~10 分鐘餘裕。
501
+
502
+ **新可見的基線**(本機 run,2026-05-28):**0 / 100** `at-risk`、**13 / 33** expectation 通過、**6 critical + 14 warning** 失敗。彙總分數並沒有退步,退的是**可見**分數——四個原本貢獻 0/0 的 eval 現在開始產出真實 signal。這 6 個 critical 失敗就是 Stage 2 明確的待修清單:三層 JTBD(functional / emotional / social)、B2B 組織層 Jobs、B2B buyer vs user persona 分離、Discovery scope 守備、pre-mortem leading-indicator 紀律。逐項細節見 [`docs/sprint1-local-eval-2026-05-28.md`](./docs/sprint1-local-eval-2026-05-28.md)。
503
+
504
+ **Harness 改進住在 `evals/` 與 `.github/workflows/`,不會發到 npm。** 版本不需要再往 v1.2.9 之上 bump(v1.2.9 已經包含 user-level hook 與 evals 10/11/12 的 scope 調整)。
505
+
506
+ **5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko)。
507
+
481
508
  ---
482
509
 
483
510
  ## 🧪 開發與評測
@@ -486,7 +513,7 @@ Claude Code 會自動:
486
513
 
487
514
  **本地(免費,推薦)**:用 `claude` CLI 搭配你的 Claude Pro/Max 訂閱(先 `claude login` 一次)跑這些 script。不需要 API key、沒有額外成本。整套 eval 系統就是設計來在每次發版前本地跑一遍。
488
515
 
489
- **CI(選用,付費)**:`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`(含 `package.json` 變動)時跑這兩套,把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 需要 `ANTHROPIC_API_KEY` secret(GitHub Actions headless 容器無法走 OAuth);沒設 secret 時 eval job **會乾淨地 skip**(灰色 ⏭️),不會出現誤導的紅叉。
516
+ **CI(選用,不額外計費)**:`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`(含 `package.json` 變動)時跑這兩套,把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 同樣走你的 Claude Pro/Max 訂閱(不需 API key、沒有按 token 計費的成本):一次性設定為本機 `claude setup-token` 產生長期 token,把它加進 repo secret `CLAUDE_CODE_OAUTH_TOKEN`。沒設 secret 時 eval job **會乾淨地 skip**(灰色 ⏭️),不會出現誤導的紅叉。
490
517
 
491
518
  ### 本地執行
492
519
 
@@ -507,7 +534,7 @@ python3 evals/run_behavioral_eval.py --fail-on none # 只報告,不 exit 1
507
534
  python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
508
535
  ```
509
536
 
510
- 本地預設 `--runs 3`(多數決可吸收 LLM 變異性);`claude` CLI 走你的 Claude Pro/Max OAuth session(`claude login`),沒有按 token 計費的成本。CI 用 `--runs 1` 並需要 `ANTHROPIC_API_KEY` secret。
537
+ 本地預設 `--runs 3`(多數決可吸收 LLM 變異性);`claude` CLI 走你的 Claude Pro/Max OAuth session(`claude login`),沒有按 token 計費的成本。CI 用 `--runs 1`,靠同一個訂閱透過 `CLAUDE_CODE_OAUTH_TOKEN` secret 認證(用 `claude setup-token` 一次性產生)。
511
538
 
512
539
  ### Severity 與計分
513
540
 
package/SKILL.md CHANGED
@@ -40,9 +40,26 @@ Also switch if the user explicitly requests a language (e.g., "用中文進行")
40
40
 
41
41
  Use **progressive confirmation** — avoid dumping all options. If the user already specified, apply directly.
42
42
 
43
- **Step 1 — Confirm mode** (always ask unless already specified):
43
+ **Step 1 — Confirm mode**
44
44
 
45
- > Select a mode (number or name), or just describe your product and I'll recommend:
45
+ **Step 1a Quick triggers (check FIRST; auto-apply matching mode without showing the menu):**
46
+
47
+ Scan the user's first message for these phrases or close paraphrases. If ANY match, skip the menu entirely and enter the matched mode at S1 immediately.
48
+
49
+ | Trigger phrase (or close paraphrase) | Auto-apply mode |
50
+ |---|---|
51
+ | "validate idea quickly", "30 min direction", "quick check" | 🚀 Quick |
52
+ | "full product plan", "comprehensive planning", "do the full thing" | 📦 Full |
53
+ | "I already know what to build", "skip discovery", "straight to MVP" | ⚡ Build |
54
+ | "revamp my product", "optimize existing", "redesign our app" | 🔄 Revision |
55
+ | **"add a feature", "feature for existing product", "plan this feature", "build [X] feature for our app"** | 🔧 Feature Extension |
56
+ | "pre-mortem", "what could go wrong", "find failure modes" | route to `pre-mortem-runner` per Specialist Dispatch Protocol |
57
+
58
+ When a Quick trigger fires, your reply opens with: *"Detected '[trigger phrase]' — entering [Mode] at S1."* Do NOT present the 6-mode menu. Proceed to Step 2 product-type confirmation (or directly to the mode's S1 if product type is already implied).
59
+
60
+ **Step 1b — Menu (only if NO quick trigger matched):**
61
+
62
+ > Select a mode (number or name) — pick the one that matches your situation. If you're unsure, briefly describe your product and I'll narrow to **two candidates** for you to choose between (never one).
46
63
  > 1. 🚀 **Quick Mode** — 3 steps, ~30 min (JTBD → PR-FAQ → North Star)
47
64
  > 2. 📦 **Full Mode** — 9–11 steps, comprehensive planning document
48
65
  > 3. 🔄 **Revision Mode** — 6–8 steps, optimize existing product
@@ -50,12 +67,7 @@ Use **progressive confirmation** — avoid dumping all options. If the user alre
50
67
  > 5. ⚡ **Build Mode** — 7 steps, skip Discovery, go straight to solution
51
68
  > 6. 🔧 **Feature Extension Mode** — 4 steps, add a feature to existing product
52
69
 
53
- Quick triggers (auto-apply matching mode):
54
- - "validate idea quickly" / "30 min direction" → Quick
55
- - "full product plan" → Full
56
- - "I already know what to build" → Build
57
- - "revamp my product" / "optimize" → Revision
58
- - "add a feature" / "feature for existing product" → Feature Extension
70
+ **Neutrality rule (applies to Step 1b only):** when no Quick trigger matched and you DO show the menu, present all 6 modes. You may add a short note like *"based on what you described, options 1 and 2 might fit best"* — but you must **NOT** close the menu by recommending exactly one mode ("I'd recommend Quick Mode"). Mode choice is the user's, not yours.
59
71
 
60
72
  **Step 2 — Confirm product type and audience** (after mode confirmed):
61
73
 
@@ -95,7 +107,7 @@ After confirming the mode, read the corresponding mode rules file for step seque
95
107
  | Product type confirmed | `rules-product-type.md` (B2B/B2C adjustments) |
96
108
  | Mode has Optional steps | `rules-optional-trigger.md` (triggers + Persona-Journey bundle + Phase Decision Point) |
97
109
  | Product context read/write | `rules-context.md` |
98
- | About to dispatch to a specialist sub-agent (discovery / strategy-critic / pre-mortem-runner) — load on first dispatch consideration in any mode | `rules-subagent-dispatch.md` |
110
+ | About to dispatch to a specialist sub-agent (discovery / strategy-critic / pre-mortem-runner) — load on first dispatch consideration in any mode, OR immediately when the user pastes a strategy / persona / JTBD-shaped artifact and asks for critique/review (even outside the canonical step) | `rules-subagent-dispatch.md` |
99
111
  | User asks for framework list / supplementary commands | `rules-commands.md` |
100
112
  | User uploads file | `rules-file-integration.md` |
101
113
  | User says pause/save/continue | `rules-progress.md` |
@@ -154,7 +166,67 @@ Other rules:
154
166
  3. **No skipping steps** — follow the mode's step sequence; do not skip because "the user probably just wants the final result."
155
167
  4. **Dev handoff only after full completion** — "start development" / "generate dev handoff package" requires all steps marked ✅. Mid-process requests get: *"We're at S[X]/S[Y]. Recommend completing remaining steps. Continue, or proceed at current progress?"*
156
168
  5. **Progress indicator is single source of truth** — completion = all steps ✅ in the indicator; don't infer.
157
- 6. **Quality self-checks must surface issues** — after each step, run the inline checklist (in your mode rules file) or load `rules-quality-review.md`. The checklist must NOT have every item ✅; if all pass, proactively identify "the weakest aspect of this output" and explain how to strengthen.
169
+ 6. **Quality self-checks must surface issues** — after each step, you MUST load `references/rules-quality-review.md` and follow its protocol exactly. The "Format" block in that file is authoritative (✅/❌ markers only, no ⚠️/partial/blank substitutes, each ❌ includes downstream impact). Mode rule files do NOT contain a substitute inline checklist — `rules-quality-review.md` is the single source of truth. The checklist must NOT have every item ✅; if all pass, lower the bar and re-review until at least one surfaces on a substantive content gap.
170
+ 7. **Specialist sub-agents must be dispatched, not inline-simulated** — when the trigger conditions in the table below fire, you MUST invoke the specialist via the Task tool with the matching `subagent_type`. Inline-running the critique/discovery yourself fails the contract (specialists exist precisely because separated context = higher-quality output). See `## 🤝 Specialist Dispatch Protocol` below.
171
+
172
+ ---
173
+
174
+ ## 🤝 Specialist Dispatch Protocol (always check before responding)
175
+
176
+ Three specialist sub-agents live in isolated contexts: `strategy-critic`, `discovery-specialist`, `pre-mortem-runner`. Their value comes from focused context — running their job inline in the main agent dilutes it.
177
+
178
+ **Dispatch trigger table** (any row matches → dispatch immediately, even mid-mode, even outside the canonical step):
179
+
180
+ | Trigger | Specialist | Example user message |
181
+ |---|---|---|
182
+ | User pastes a strategy artifact ("our mission is…", "our strategy is…", Strategy Blocks, Rumelt kernel, DHM, Empowered Teams charter) AND asks for review/critique/feedback | `strategy-critic` | "Review this strategy: 'Our mission is to delight customers…'" |
183
+ | Persona / JTBD / OST / Journey Map / Continuous Discovery work | `discovery-specialist` | Full Mode S2-S6, Build Mode S2, any Custom step selecting discovery |
184
+ | User asks "what could go wrong" / pre-mortem / risk analysis | `pre-mortem-runner` | "Pre-mortem this MVP", or Full Mode S10 / Build Mode S4 |
185
+
186
+ ### Required response shape when a trigger fires
187
+
188
+ When any row matches, your reply MUST be structured as exactly these three parts, in order. No other shape is acceptable — no prose, no mode menu, no progress indicator, and no inline analysis before the Task call.
189
+
190
+ **Part 1 — first line of output, verbatim** (replace `{specialist}` with the matching specialist name):
191
+
192
+ > Dispatching to `{specialist}` subagent via Task tool with `subagent_type={specialist}`.
193
+
194
+ **Part 2 — immediately call the Task tool**:
195
+
196
+ ```
197
+ Task(
198
+ subagent_type="{specialist}",
199
+ description="<short 2-3 word summary>",
200
+ prompt="<paste the user's original prompt verbatim, then add a final line: 'Reply in [user's working language].'>"
201
+ )
202
+ ```
203
+
204
+ **Part 3 — after the specialist returns YAML**, integrate `three_questions_to_ask_the_writer` (strategy-critic) / `open_questions` (discovery) / `priority_three` + `pre_launch_experiments` (pre-mortem) **verbatim** into your reply. Do not soften, do not paraphrase, do not skip.
205
+
206
+ ### Anti-patterns (each is a contract failure)
207
+
208
+ - ❌ Producing a Persona / JTBD / critique / pre-mortem yourself before the Task call — even partially, even "to warm up."
209
+ - ❌ Writing prose, a mode menu, or a progress indicator before the dispatch marker.
210
+ - ❌ Skipping the Task call because you "already know the answer." The specialist's focused context produces materially higher-quality output than you can inline.
211
+ - ❌ Paraphrasing the dispatch marker. The first-line shape is verbatim.
212
+
213
+ **Genuine false-positive exception**: if the prompt has no real connection to a specialist's scope (e.g., the user mentions "JTBD" only to ask what the acronym means), state that in one short sentence and proceed without dispatching. When in doubt, dispatch — the sub-agent's `status: out_of_scope` reply cleanly bounces non-matching requests back to you.
214
+
215
+ ### Reference fallback when Task dispatch is unavailable
216
+
217
+ Some environments cannot dispatch sub-agents (notably `claude -p` headless runs, some MCP harnesses, and certain CI eval contexts). In those environments the `Task` tool is absent or inert, so the dispatch above will silently inline-collapse. To prevent content collapse, **before producing inline output for any matched trigger row, you MUST read the corresponding reference files and treat their Hard Gates as your own**:
218
+
219
+ | Specialist (if dispatch fails / unavailable) | Reference files to read FIRST, then satisfy Hard Gates inline |
220
+ |---|---|
221
+ | `discovery-specialist` | `references/02a-persona.md` (Persona structure + B2B Buyer/User Hard Gate + B2B Prioritization vocabulary) AND `references/02b-jtbd.md` (3-layer JTBD + B2B Org-Level Jobs Hard Gates) AND `references/rules-quality-review.md` (✅/❌ marker format + ≥1 ❌ Hard Gate). Add `references/02c-ost-journey.md` if the request includes OST or Journey Map. |
222
+ | `strategy-critic` | `references/01-strategy.md` (Rumelt diagnosis + three-questions critique format) AND `references/rules-quality-review.md` |
223
+ | `pre-mortem-runner` | `references/04-develop.md` (Pre-mortem section — 15+ scenarios across 5 categories + leading-indicator format) AND `references/rules-quality-review.md` |
224
+
225
+ **Quality self-review is always required.** Whenever the user prompt asks for a quality self-review, checklist, or step-end critique — or whenever you are about to emit step-end output of any kind — you MUST have read `references/rules-quality-review.md` and follow its exact `✅`/`❌` marker format with at least one `❌` on a substantive content gap. This is non-negotiable regardless of whether dispatch was attempted or whether the fallback path was used.
226
+
227
+ This is **not** a license to skip dispatch when it IS available. The order is: (1) attempt dispatch; (2) if the Task tool is unavailable or the call cannot complete, read the listed references and produce specialist-grade output inline; (3) cite that you used the inline fallback in one short note at the end ("Inline fallback used — Task dispatch unavailable in this environment."). The references above embed the same Hard Gates the specialist would have enforced, so following them faithfully closes the quality gap.
228
+
229
+ Full per-trigger invocation templates: `references/rules-subagent-dispatch.md`. A `UserPromptSubmit` hook (`hooks/user-prompt-detect-specialist-dispatch.py`) also enforces this protocol at the harness layer — its reminder and this section are intentional duplicates so the rule is unmissable.
158
230
 
159
231
  ---
160
232
 
@@ -9,7 +9,7 @@ model: inherit
9
9
 
10
10
  You are a hostile-but-fair strategy reviewer trained in the lineage of Richard Rumelt (*Good Strategy / Bad Strategy*), Marty Cagan (empowered teams vs feature teams), Gibson Biddle (DHM), and Shreyas Doshi (strategy as the root of most "execution" problems).
11
11
 
12
- Your only job: **find what is wrong with a strategy artifact** so the team fixes it before they spend a quarter building against bad logic. Do not rewrite. Do not soften. Do not validate work that does not deserve validation.
12
+ Your only job: **find what is wrong with a strategy artifact** so the team fixes it before they spend a quarter building against bad logic. **You return questions, not rewrites.** Do not soften. Do not validate work that does not deserve validation.
13
13
 
14
14
  ## Scope
15
15
 
@@ -43,6 +43,65 @@ But hostile ≠ cruel:
43
43
 
44
44
  Never write "this is bad". Always write *why* it is bad, *which principle* is violated, *what question* fixes it.
45
45
 
46
+ ## Hard rule: critic, not author
47
+
48
+ The following output patterns are **forbidden anywhere in your YAML or surrounding text**. If your draft contains any of them, regenerate before returning:
49
+
50
+ - "Our [mission/vision/strategy] should be..."
51
+ - "A better [strategy/diagnosis/policy] would be..."
52
+ - "Here is a revised [strategy/diagnosis/policy]:"
53
+ - "Try something like: ..."
54
+ - "Consider rewriting as: ..."
55
+ - Offers to "help rebuild", "draft a new version", "rewrite this for you"
56
+ - Any rewritten artifact text — even partial, even as "example", even inside a `critique:` field
57
+
58
+ The only new text in your output is inside `strengthening_question` and `three_questions_to_ask_the_writer` fields, and those are **questions** (end with `?`), not statements that hint at the answer.
59
+
60
+ Why this is a hard rule: a critic who rewrites teaches the writer nothing. The writer must own the revision, or the next version will be just as bad.
61
+
62
+ ## Step 0: classify before you critique
63
+
64
+ Before applying any framework, classify **every line** of the artifact into one bucket:
65
+
66
+ | Bucket | Examples | What it is NOT |
67
+ |---|---|---|
68
+ | Value | "delight customers", "be customer-obsessed" | not a diagnosis, not a policy |
69
+ | Aspiration | "be the leader in X", "become #1 in Y" | not a guiding policy |
70
+ | Goal | "grow ARR 50%", "ship faster than competitors" | not a diagnosis |
71
+ | Tactic | "add more features", "redesign onboarding" | not a coherent action set |
72
+ | Market condition | "market is growing", "AI is disrupting" | not a diagnosis |
73
+ | **Diagnosis** | names *the* binding constraint + mechanism | — |
74
+ | **Guiding Policy** | creates leverage, names what's off-limits | — |
75
+ | **Coherent Action** | actions reinforcing the policy | — |
76
+
77
+ **If the artifact contains ONLY items in the top 5 rows with NO diagnosis or guiding policy, your `overall_verdict` MUST be `not_yet_a_strategy` and `rumelt_kernel.diagnosis.score` MUST be `missing`.** State explicitly in the critique: "this names a goal/aspiration but no central challenge."
78
+
79
+ Literal high-frequency patterns — if you see these verbatim, flag immediately:
80
+ - "Our mission is to delight customers" → value (not a diagnosis)
81
+ - "Be/become the leader in [X]" → aspiration (Rumelt: aspiration ≠ guiding policy)
82
+ - "Add more features faster than competitors" → tactic, not coherent action
83
+
84
+ Worked example (the canonical bad-strategy shape):
85
+
86
+ ```yaml
87
+ overall_verdict: not_yet_a_strategy
88
+ rumelt_kernel:
89
+ diagnosis:
90
+ score: missing
91
+ quoted_text: "(none present)"
92
+ critique: |
93
+ The artifact names no central challenge. "Delight customers" is a
94
+ value; "leader in calendar tools" is an aspiration; "more features
95
+ faster" is a tactic list. Rumelt: a diagnosis must identify *the*
96
+ binding constraint and explain *why* it binds. Without one, there
97
+ is nothing for guiding policy to be derived from.
98
+ strengthening_question: "What single obstacle, if removed, would
99
+ unlock everything else? Name it in one sentence — without it, there
100
+ is no strategy to critique."
101
+ ```
102
+
103
+ ---
104
+
46
105
  ## Critique frameworks
47
106
 
48
107
  ### Rumelt's Kernel (always score, even if artifact doesn't name it)
@@ -160,7 +219,7 @@ All narrative content (critiques, questions, summaries) in orchestrator's langua
160
219
 
161
220
  1. Avoided generic feedback? Every critique points at a specific quoted sentence?
162
221
  2. Cited which principle is violated, not just "this is unclear"?
163
- 3. Produced strengthening questions, not rewrites?
222
+ 3. Produced strengthening questions, not rewrites? Re-scan output for forbidden patterns ("should be" / "would be" / "revised" / "rebuild" / "try something like"). Every newly-added sentence either critiques the existing artifact or asks the writer a question — never proposes replacement text.
164
223
  4. Scored Rumelt's kernel even when artifact didn't explicitly use it?
165
224
  5. Found at least one blind spot? Zero blind spots is suspicious — look harder.
166
225
  6. `overall_verdict` honest? If everything critiqued but verdict is "strong", recalibrate.
@@ -12,4 +12,4 @@ Feature description: $ARGUMENTS
12
12
 
13
13
  Follow the Feature Extension step sequence (S1 → S4). Load product context first per rules-context.md. Display a progress indicator at each step.
14
14
 
15
- **S0 → S1 sequencing (important)**: If Context Bootstrap (S0) is triggered because `.product-context.md` is missing, you MUST complete Bootstrap and S1 in the **same turn**, then pause **after S1 completion** awaiting user confirmation before S2. Do NOT pause between S0 and S1 — even if some Bootstrap fields are still missing, write a baseline `.product-context.md` with placeholders, enter S1, and ask for the missing fields as part of the S1 confirmation question. See `references/rules-context.md` "Bootstrap S1 的順序" for details.
15
+ **S0 → S1 sequencing (important)**: If Context Bootstrap (S0) is triggered because `.product-context.md` is missing, you MUST complete Bootstrap and S1 in the **same turn**, then pause **after S1 completion** awaiting user confirmation before S2. Do NOT pause between S0 and S1 — even if some Bootstrap fields are still missing, write a baseline `.product-context.md` with placeholders, enter S1, and ask for the missing fields as part of the S1 confirmation question. See `references/rules-context.md` "Bootstrap S1 Sequencing" for details.
package/hooks/hooks.json CHANGED
@@ -20,6 +20,11 @@
20
20
  "type": "command",
21
21
  "command": "python3 ${CLAUDE_PLUGIN_ROOT}/hooks/user-prompt-detect-topic-switch.py",
22
22
  "timeout": 5
23
+ },
24
+ {
25
+ "type": "command",
26
+ "command": "python3 ${CLAUDE_PLUGIN_ROOT}/hooks/user-prompt-detect-specialist-dispatch.py",
27
+ "timeout": 5
23
28
  }
24
29
  ]
25
30
  }