npm - product-playbook - Versions diffs - 1.2.5 → 1.2.6 - Mend

product-playbook 1.2.5 → 1.2.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.es.md +95 -0
package/README.ja.md +95 -0
package/README.ko.md +95 -0
package/README.md +68 -0
package/README.zh-CN.md +95 -0
package/README.zh-TW.md +94 -0
package/agents/discovery-specialist.md +161 -0
package/agents/pre-mortem-runner.md +179 -0
package/agents/strategy-critic.md +168 -0
package/install.sh +4 -0
package/package.json +10 -1

package/README.es.md CHANGED Viewed

@@ -451,6 +451,101 @@ El consumo de tokens es prácticamente idéntico en ambos brazos (151K vs 154K)
 ---
+### Iteración 6: Pase de Optimización de Tokens (v1.2.5)
+Una iteración de reducción de tokens. Misma semántica del contenido del skill, menor huella por sesión. Objetivo: ≥25% de reducción de tokens manteniendo la calidad al 100%.
+**Cambios entregados:**
+- **SKILL.md adelgazado** — se extrajeron las Sub-Agent Delegation Rules al lazy `rules-subagent-dispatch.md`; se ajustaron las descripciones de Hard Gate; se consolidó la duplicación de Mode Overview. **6,188 → 2,877 tokens (-54%)** para el entry point eager.
+- **División de rules-context.md** — se mantuvo la lógica de decisión como eager (1,594 tokens); se movieron las plantillas YAML verbosas + procedimiento de Bootstrap + scripts de UX de conflicto al lazy `rules-context-template.md` (1,849 tokens, cargado sólo al activarse).
+- **rules-quality-review.md adelgazado** — destilado de 1,040 → 817 tokens con un protocolo compacto de 3 pasos + checklists de 1 línea por framework.
+- **Agentes especialistas adelgazados** — se removió el conocimiento de framework embebido que duplicaba `references/*.md`, reemplazado con punteros on-demand. **discovery-specialist −25%, strategy-critic −18%, pre-mortem-runner −20%** por despacho.
+**Ahorros estimados por sesión Full Mode de 9 pasos:**
+| Fuente | Antes | Después | Ahorrado |
+|--------|:------:|:-----:|:-----:|
+| Eager (SKILL + context + progress) | ~8,800 | ~5,500 | **−3,300** |
+| Quality review (×9 cargas por paso) | ~9,360 | ~7,353 | **−2,007** |
+| Despachos de sub-agent (3 especialistas) | ~9,005 | ~7,106 | **−1,899** |
+| **Total por sesión** | **~27,200** | **~18,900** | **−8,300 (−30%)** |
+**Validación de calidad:** pre-mortem-runner (el especialista más sensible a calidad según Iteración 5) re-ejecutó eval-12 sobre el contenido adelgazado de v1.2.5. Resultado: **9/9 assertions PASS** — 16 escenarios cubriendo las 5 categorías, 5 escenarios fundamentados en arquitectura citando componentes reales del stack, 5 experimentos pre-launch de bajo costo con reglas de decisión binaria, encuadre en tiempo pasado mantenido. Una verificación cruzada estática confirmó que las assertions de eval-10/11 (13 en total) tienen soporte explícito en los prompts adelgazados de los agentes.
+**Trade-off de costo de tokens:** la división añade 2 nuevos archivos lazy (`rules-subagent-dispatch.md` 978 tokens, `rules-context-template.md` 1,849 tokens) que sólo cargan al activarse. En las rutas de sesión más comunes, nunca cargan. En rutas de Bootstrap-o-Conflicto, los ahorros eager siguen siendo netos positivos.
+**Replicado a 5 locales i18n** (zh-TW, zh-CN, ja, es, ko) preservando las traducciones existentes — el adelgazamiento estructural se aplicó de manera idéntica por idioma.
+---
+## 🧪 Desarrollo y Evals
+El directorio `evals/` incluye dos suites de pruebas complementarias y un scorer determinista.
+**Local (gratis, recomendado)**: ejecuta los mismos scripts con el CLI `claude` autenticado con tu suscripción Claude Pro/Max (un solo `claude login`). Sin API key, sin costo marginal. El sistema de eval está diseñado para correr localmente antes de cada release.
+**CI (opcional, pago)**: `.github/workflows/eval-gate.yml` ejecuta ambas suites en cada PR y en cada push a `main` que cambie `package.json`, y reporta el puntaje en el Job Summary del workflow. **No bloquea merge ni publish** — el maintainer decide si actuar ante regresiones. CI requiere el secret `ANTHROPIC_API_KEY` (GitHub Actions no puede usar OAuth en un contenedor headless); sin el secret, los jobs de eval **se omiten limpiamente** (gris ⏭️) en lugar de fallar en rojo.
+### Ejecución local
+```bash
+# Recomendado: un comando ejecuta ambas suites
+npm run eval
+# O ejecuta cada una por separado
+npm run eval:trigger      # ~5–15 min — el skill se activa automáticamente?
+npm run eval:behavioral   # ~10–40 min — claude como assistant Y como judge
+npm run eval:zh-TW        # behavioral eval contra el set en zh-TW
+npm run eval:quick        # solo 1 corrida, sin mayoría (iteración rápida)
+npm run eval:test         # tests unitarios del scorer
+# Llamá los scripts Python directamente cuando necesites control fino:
+python3 evals/run_behavioral_eval.py --only 11        # debuggear un eval id
+python3 evals/run_behavioral_eval.py --fail-on none   # solo reporta, sin exit 1
+python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
+```
+Los runs locales usan `--runs 3` por defecto (mayoría absorbe la variabilidad del LLM). El CLI `claude` usa tu sesión OAuth de Claude Pro/Max (`claude login`), sin costo por token. CI usa `--runs 1` y requiere el secret `ANTHROPIC_API_KEY`.
+### Severity y scoring
+Cada expectation en `evals.json` está etiquetada con una severity:
+| Severity | Deducción por fallo | Usado para |
+|---|---|---|
+| `critical` | −15 | Violaciones de Hard Gate, errores de mode-dispatch, separación buyer/user en B2B, defaults de seguridad, integridad de framework (3 capas de JTBD, diagnosis de Rumelt, 15+ escenarios de pre-mortem) |
+| `warning`  | −5  | Profundidad y estructura de calidad (la mayoría de expectations) |
+| `info`     | −1  | Detección de idioma, formato del indicador de progreso |
+El score empieza en 100, deduce por fallo, y se clampa a 0–100.
+| Banda | Rango | Significado |
+|---|---|---|
+| 🟢 `healthy` | ≥ 90 | Como máximo un fallo crítico |
+| 🟡 `needs-attention` | ≥ 70 | Hasta dos críticos o varios warnings |
+| 🔴 `at-risk` | < 70 | Tres o más críticos; el gate debería fallar |
+### Semántica de `--fail-on`
+| Valor del flag | El runner sale con código no-cero cuando… |
+|---|---|
+| `critical` | falla cualquier expectation critical (default en CI) |
+| `any` | falla cualquier expectation en cualquier severity |
+| `none` | nunca; modo informativo para exploración local |
+Toda la lógica de scoring vive en una sola fuente — `evals/compute_eval_score.py` — para que los dos runners no puedan divergir.
+### Checklist de release
+Antes de bumpear la versión en `package.json` (un push a `main` con `package.json` modificado dispara `npm publish`):
+1. `npm run eval` — obtené los puntajes actuales de trigger + behavioral
+2. Si falla algún **critical**, investigá y arreglá antes de publicar
+3. Si solo retrocedieron warnings o info → es decisión tuya; si aceptás la regresión, anotá el motivo en el commit
+4. Commiteá cualquier fix, bumpeá la versión, después `git push`
+---
 ## 💬 Comandos Disponibles
 ### ⌨️ Comandos Slash del CLI de Claude Code

package/README.ja.md CHANGED Viewed

@@ -452,6 +452,101 @@ v1.2.0+ で導入された3つの専門 sub-agent（`discovery-specialist`、`st
 ---
+### イテレーション6：Token 最適化パス（v1.2.5）
+token 削減イテレーション。スキル内容のセマンティクスは同じで、セッションあたりのフットプリントを縮小。目標は品質を 100% に維持しながら 25% 以上の token 削減。
+**出荷した変更：**
+- **SKILL.md スリム化** — Sub-Agent Delegation Rules を lazy な `rules-subagent-dispatch.md` に抽出、Hard Gate の記述を簡潔化、Mode Overview の重複を統合。eager エントリポイントで **6,188 → 2,877 tokens（-54%）**。
+- **rules-context.md 分割** — 決定ロジックは eager のまま維持（1,594 tokens）、冗長な YAML テンプレート + Bootstrap 手順 + Conflict UX スクリプトを lazy な `rules-context-template.md`（1,849 tokens、トリガー時のみ読込）に移動。
+- **rules-quality-review.md スリム化** — 1,040 → 817 tokens に蒸留、コンパクトな 3 ステップのプロトコル + 各 framework 1 行のチェックリスト。
+- **Specialist agents スリム化** — `references/*.md` と重複していた framework 知識を削除し、on-demand のポインタに置換。dispatch あたり **discovery-specialist −25%、strategy-critic −18%、pre-mortem-runner −20%**。
+**9 ステップ Full Mode セッションあたりの推定削減：**
+| ソース | Before | After | 削減 |
+|--------|:------:|:-----:|:-----:|
+| Eager（SKILL + context + progress） | ~8,800 | ~5,500 | **−3,300** |
+| Quality review（×9 ステップロード） | ~9,360 | ~7,353 | **−2,007** |
+| Sub-agent dispatches（3 specialists） | ~9,005 | ~7,106 | **−1,899** |
+| **セッション合計** | **~27,200** | **~18,900** | **−8,300（−30%）** |
+**品質検証：** pre-mortem-runner（イテレーション 5 で最も品質感受性が高い specialist）が v1.2.5 のスリム化された内容で eval-12 を再実行。結果は **9/9 assertion PASS** — 全 5 カテゴリーにまたがる 16 シナリオ、実際のスタック構成要素を引用するアーキテクチャ根拠付きシナリオ 5 件、二項決定ルールを持つ低コスト pre-launch 実験 5 件、過去形の枠組みを維持。静的クロスチェックにより、スリム化された agent プロンプトにおいて eval-10/11 の assertion（合計 13 件）すべてに明示的な裏付けがあることを確認。
+**Token コストのトレードオフ：** 分割により、トリガー時のみ読み込まれる 2 つの新規 lazy ファイル（`rules-subagent-dispatch.md` 978 tokens、`rules-context-template.md` 1,849 tokens）が追加される。最も一般的なセッション経路ではこれらは読み込まれない。Bootstrap または Conflict 経路でも、eager の削減が依然としてネットでプラス。
+**5 つの i18n ロケール（zh-TW、zh-CN、ja、es、ko）にミラー** — 既存の翻訳を保持しつつ、構造的なスリム化を言語ごとに同一に適用。
+---
+## 🧪 開発と評価
+`evals/` ディレクトリには 2 つの補完的なテストセットと決定論的なスコアラーが含まれます。
+**ローカル（無料、推奨）**：`claude` CLI を Claude Pro/Max サブスクリプションで認証して（一度だけ `claude login`）同じスクリプトを実行できます。API key 不要、追加コストなし。eval システムは各リリース前にローカルで実行する設計です。
+**CI（オプション、有料）**：`.github/workflows/eval-gate.yml` はすべての PR と `package.json` を変更する `main` への push で両方を実行し、スコアを workflow の Job Summary に書き込みます。**merge も publish もブロックしません** — リグレッションに対応するかどうかはメンテナが判断します。CI には `ANTHROPIC_API_KEY` secret が必要です（GitHub Actions は headless コンテナで OAuth が使えません）。secret 未設定時は eval job が**クリーンに skip**（グレー ⏭️）され、誤解を招く赤バツは出ません。
+### ローカル実行
+```bash
+# 推奨：1 コマンドで両方を実行
+npm run eval
+# 個別に実行
+npm run eval:trigger      # ~5–15 分 — skill が自動トリガーされるか
+npm run eval:behavioral   # ~10–40 分 — claude を assistant 兼 judge として使用
+npm run eval:zh-TW        # zh-TW 評価セットで behavioral eval
+npm run eval:quick        # 1 回のみ実行、多数決なし（高速イテレーション）
+npm run eval:test         # スコアラーのユニットテスト
+# より細かい制御が必要な場合は、Python スクリプトを直接呼び出します：
+python3 evals/run_behavioral_eval.py --only 11        # 単一の eval id を debug
+python3 evals/run_behavioral_eval.py --fail-on none   # レポートのみ、exit 1 なし
+python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
+```
+ローカルは `--runs 3` がデフォルト（多数決で LLM のばらつきを吸収）。`claude` CLI は Claude Pro/Max の OAuth セッション（`claude login`）を使うため、トークン課金はありません。CI は `--runs 1` で、`ANTHROPIC_API_KEY` secret が必要です。
+### Severity とスコアリング
+`evals.json` の各 expectation には severity がタグ付けされています：
+| Severity | 失敗時の減点 | 適用範囲 |
+|---|---|---|
+| `critical` | −15 | Hard Gate 違反、Mode dispatch エラー、B2B buyer/user 分離、Security default-on、フレームワークの完全性（JTBD 3 層、Rumelt diagnosis、pre-mortem 15+ シナリオ）|
+| `warning`  | −5  | 品質の深さと構造（多くの expectations）|
+| `info`     | −1  | 言語検出、Progress indicator フォーマット |
+100 点からスタート、失敗ごとに減点、0–100 にクランプ。
+| Band | 範囲 | 意味 |
+|---|---|---|
+| 🟢 `healthy` | ≥ 90 | critical 失敗は最大 1 つ |
+| 🟡 `needs-attention` | ≥ 70 | critical 2 つまで、または数個の warning |
+| 🔴 `at-risk` | < 70 | critical が 3 つ以上；gate 失敗の対象 |
+### `--fail-on` のセマンティクス
+| Flag 値 | Runner が exit non-zero になる条件 |
+|---|---|
+| `critical` | critical な expectation が 1 つでも失敗（CI デフォルト）|
+| `any` | severity 問わず任意の expectation が失敗 |
+| `none` | 失敗しない；ローカル探索用 informational mode |
+すべてのスコアリングロジックは `evals/compute_eval_score.py` という単一のソースに集約され、2 つの runner が独自実装で drift することを防ぎます。
+### リリース前チェックリスト
+`package.json` のバージョン bump 前（`main` への push で `package.json` が変わると `npm publish` が走ります）：
+1. `npm run eval` — 現在の trigger と behavioral スコアを取得
+2. **critical** な expectation が失敗 → 公開前に調査して修正
+3. warning や info のみが退化 → 判断次第。退化を受け入れる場合は commit にその理由を残す
+4. 修正があれば commit、バージョン bump、`git push`
+---
 ## 💬 利用可能なコマンド
 ### ⌨️ Claude Code CLIスラッシュコマンド

package/README.ko.md CHANGED Viewed

@@ -451,6 +451,101 @@ v1.2.0+ 에서 도입된 3개의 전문 sub-agent (`discovery-specialist`, `stra
 ---
+### 반복 6: 토큰 최적화 패스 (v1.2.5)
+토큰 절감 반복. 스킬 콘텐츠의 시맨틱은 동일하게 유지하되, 세션당 footprint 를 축소. 목표: 품질 100% 를 유지하면서 토큰 ≥25% 감소.
+**적용된 변경 사항:**
+- **SKILL.md 슬림화** — Sub-Agent Delegation Rules 를 lazy `rules-subagent-dispatch.md` 로 추출, Hard Gate 설명을 축약, Mode Overview 의 중복을 통합. eager 진입점 기준 **6,188 → 2,877 tokens (-54%)**.
+- **rules-context.md 분할** — 의사결정 로직은 eager (1,594 tokens) 로 유지, 장문 YAML 템플릿 + Bootstrap 절차 + Conflict UX 스크립트는 lazy `rules-context-template.md` (1,849 tokens, 트리거 시에만 로드) 로 이동.
+- **rules-quality-review.md 슬림화** — 1,040 → 817 tokens 로 정제, 컴팩트한 3단계 프로토콜과 프레임워크당 1줄 체크리스트로 구성.
+- **전문가 에이전트 슬림화** — `references/*.md` 와 중복되던 임베디드 프레임워크 지식을 제거하고 온디맨드 포인터로 대체. dispatch 당 **discovery-specialist −25%, strategy-critic −18%, pre-mortem-runner −20%**.
+**9-step Full Mode 세션당 예상 절감:**
+| 출처 | Before | After | 절감 |
+|------|:------:|:-----:|:-----:|
+| Eager (SKILL + context + progress) | ~8,800 | ~5,500 | **−3,300** |
+| Quality review (×9 step loads) | ~9,360 | ~7,353 | **−2,007** |
+| Sub-agent dispatch (3개 전문가) | ~9,005 | ~7,106 | **−1,899** |
+| **세션당 합계** | **~27,200** | **~18,900** | **−8,300 (−30%)** |
+**품질 검증:** pre-mortem-runner (반복 5 기준 품질에 가장 민감한 전문가) 가 v1.2.5 슬림화된 콘텐츠로 eval-12 를 재실행. 결과: **9/9 assertions PASS** — 5개 카테고리 전체에 걸친 16개 시나리오, 실제 stack 컴포넌트를 인용하는 5개의 아키텍처 기반 시나리오, 이진 의사결정 규칙을 가진 5개의 저비용 pre-launch 실험, 과거형 프레이밍 유지. 정적 cross-check 로 eval-10/11 의 assertion (총 13개) 이 슬림화된 에이전트 프롬프트 안에서 모두 명시적으로 뒷받침됨을 확인.
+**토큰 비용 트레이드오프:** 분할로 인해 2개의 새 lazy 파일 (`rules-subagent-dispatch.md` 978 tokens, `rules-context-template.md` 1,849 tokens) 이 추가되며, 트리거 시에만 로드됨. 가장 흔한 세션 경로에서는 전혀 로드되지 않음. Bootstrap-or-Conflict 경로에서도 eager 절감분이 여전히 net positive.
+**5개 i18n 로케일 (zh-TW, zh-CN, ja, es, ko) 에 미러링** — 기존 번역을 보존하며, 구조적 슬림화는 언어별로 동일하게 적용.
+---
+## 🧪 개발 및 평가
+`evals/` 디렉터리는 두 가지 보완적 테스트 세트와 결정론적 스코어러를 포함합니다.
+**로컬 (무료, 권장)**: `claude` CLI를 Claude Pro/Max 구독으로 인증해서 (한 번만 `claude login`) 같은 스크립트를 실행합니다. API key 불필요, 추가 비용 없음. eval 시스템은 각 릴리스 전에 로컬에서 실행하도록 설계되었습니다.
+**CI (선택, 유료)**: `.github/workflows/eval-gate.yml`은 모든 PR과 `package.json`을 변경하는 `main`으로의 push에서 두 가지 모두 실행하고, 점수를 workflow Job Summary에 기록합니다. **merge도 publish도 차단하지 않습니다** — 회귀에 대응할지는 유지보수자가 판단합니다. CI는 `ANTHROPIC_API_KEY` secret이 필요합니다 (GitHub Actions는 headless 컨테이너에서 OAuth를 사용할 수 없습니다). secret 미설정 시 eval job은 **깨끗하게 skip**(회색 ⏭️)되며 오해를 일으키는 빨간색 X는 표시되지 않습니다.
+### 로컬 실행
+```bash
+# 권장: 한 명령으로 두 가지 모두 실행
+npm run eval
+# 개별 실행
+npm run eval:trigger      # ~5–15분 — skill이 자동 트리거되는가
+npm run eval:behavioral   # ~10–40분 — claude를 assistant 겸 judge로 사용
+npm run eval:zh-TW        # zh-TW 평가 세트로 behavioral eval
+npm run eval:quick        # 1회만 실행, 다수결 없음 (빠른 이터레이션)
+npm run eval:test         # 스코어러 단위 테스트
+# 더 세밀한 제어가 필요할 때 Python 스크립트를 직접 호출:
+python3 evals/run_behavioral_eval.py --only 11        # 단일 eval id 디버그
+python3 evals/run_behavioral_eval.py --fail-on none   # 보고만, exit 1 없음
+python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
+```
+로컬은 `--runs 3`이 기본값(다수결로 LLM 변동성 흡수). `claude` CLI는 Claude Pro/Max OAuth 세션(`claude login`)을 사용하므로 토큰당 비용이 없습니다. CI는 `--runs 1`을 사용하며 `ANTHROPIC_API_KEY` secret이 필요합니다.
+### Severity 및 스코어링
+`evals.json`의 각 expectation에는 severity가 태그됩니다:
+| Severity | 실패 시 감점 | 적용 범위 |
+|---|---|---|
+| `critical` | −15 | Hard Gate 위반, Mode dispatch 오류, B2B buyer/user 분리, Security default-on, 프레임워크 완전성(JTBD 3계층, Rumelt diagnosis, pre-mortem 15+ 시나리오) |
+| `warning`  | −5  | 품질 깊이와 구조(대부분의 expectations) |
+| `info`     | −1  | 언어 감지, Progress indicator 형식 |
+100점에서 시작하여 실패당 차감, 0–100으로 클램프.
+| Band | 범위 | 의미 |
+|---|---|---|
+| 🟢 `healthy` | ≥ 90 | critical 실패 최대 1개 |
+| 🟡 `needs-attention` | ≥ 70 | critical 2개 이하 또는 다수의 warning |
+| 🔴 `at-risk` | < 70 | critical 3개 이상; gate 실패 대상 |
+### `--fail-on` 의미론
+| Flag 값 | Runner가 exit non-zero가 되는 조건 |
+|---|---|
+| `critical` | critical expectation이 하나라도 실패 (CI 기본값) |
+| `any` | severity와 관계없이 임의 expectation 실패 |
+| `none` | 절대 실패하지 않음; 로컬 탐색용 informational mode |
+모든 스코어링 로직은 `evals/compute_eval_score.py`라는 단일 소스에 집중되어 있어 두 runner가 독립적으로 구현하여 drift하는 것을 방지합니다.
+### 릴리스 체크리스트
+`package.json` 버전 bump 전 (`main`으로의 push가 `package.json` 변경 시 `npm publish`를 트리거):
+1. `npm run eval` — 현재 trigger와 behavioral 점수 확인
+2. **critical** expectation이 하나라도 실패하면 → 게시 전 조사 후 수정
+3. warning이나 info만 회귀 → 판단 사항. 회귀를 수용하면 commit에 이유 기록
+4. 수정 commit, 버전 bump, `git push`
+---
 ## 💬 사용 가능한 명령
 ### ⌨️ Claude Code CLI 슬래시 명령

package/README.md CHANGED Viewed

@@ -476,6 +476,74 @@ A token-reduction iteration. Same skill content semantics, smaller footprint per
 ---
+## 🧪 Development & Evals
+The `evals/` directory ships two complementary test suites and a deterministic scorer.
+**Local (free, recommended):** run the same scripts with the `claude` CLI authenticated via your Claude Pro/Max subscription (`claude login` once). No API key, no marginal cost. The eval system is designed to be run locally before each release.
+**CI (optional, paid):** `.github/workflows/eval-gate.yml` will run both suites on every PR and on every push to `main` that changes `package.json`, then report the score to the workflow Job Summary. It **never blocks merge or publish** — the maintainer decides whether to act on regressions. CI requires an `ANTHROPIC_API_KEY` secret because GitHub Actions cannot use OAuth in a headless container; without the secret, eval jobs **skip cleanly** (gray ⏭️) instead of failing red.
+### Running locally
+```bash
+# Recommended: one command runs both suites
+npm run eval
+# Or run pieces individually
+npm run eval:trigger      # ~5–15 min — checks if the skill auto-triggers
+npm run eval:behavioral   # ~10–40 min — uses claude as assistant AND judge
+npm run eval:zh-TW        # behavioral eval against the zh-TW eval set
+npm run eval:quick        # 1 run only, no majority vote (fast iteration)
+npm run eval:test         # unit tests for the scoring module
+# Drop into the underlying Python scripts when you need finer control:
+python3 evals/run_behavioral_eval.py --only 11        # debug a single eval id
+python3 evals/run_behavioral_eval.py --fail-on none   # report without exit 1
+python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
+```
+Local runs default to `--runs 3` (majority vote handles LLM variance); the `claude` CLI uses your Claude Pro/Max OAuth session (`claude login`), so there's no per-token cost. CI uses `--runs 1` and requires the `ANTHROPIC_API_KEY` secret.
+### Severity & scoring
+Every expectation in `evals.json` is tagged with one of three severities:
+| Severity | Deduction per failure | Used for |
+|---|---|---|
+| `critical` | −15 | Hard Gate violations, mode-dispatch errors, B2B buyer/user separation, security defaults, framework-level integrity (JTBD three layers, Rumelt diagnosis, pre-mortem 15+ scenarios) |
+| `warning`  | −5  | Quality depth and structure (most expectations) |
+| `info`     | −1  | Language detection, progress-indicator formatting |
+Score starts at 100, deducts per failure, clamps to 0–100.
+| Band | Range | Meaning |
+|---|---|---|
+| 🟢 `healthy` | ≥ 90 | At most one critical failure |
+| 🟡 `needs-attention` | ≥ 70 | Up to two criticals or several warnings |
+| 🔴 `at-risk` | < 70 | Three or more criticals; gate should fail |
+### `--fail-on` semantics
+| Flag value | Runner exits non-zero when… |
+|---|---|
+| `critical` | any critical expectation failed (CI default) |
+| `any` | any expectation failed at any severity |
+| `none` | never; informational mode for local exploration |
+A single source of truth — `evals/compute_eval_score.py` — implements all scoring so the two runners cannot drift apart.
+### Release checklist
+Before bumping the version in `package.json` (a push to `main` with a changed `package.json` triggers `npm publish`):
+1. `npm run eval` — get current trigger + behavioral scores
+2. If any **critical** expectation fails, investigate and fix before publishing
+3. If only warnings or info regressed, it's a judgment call — note your reasoning in the commit if you accept the regression
+4. Commit any fixes, bump the version, then `git push`
+---
 ## 💬 Available Commands
 ### ⌨️ Claude Code CLI Slash Commands

package/README.zh-CN.md CHANGED Viewed

@@ -451,6 +451,101 @@ Claude Code 会自动：
 ---
+### Iteration 6: Token 优化（v1.2.5）
+一次 token 减量迭代。Skill 内容语意不变，但每次 session 的足迹更小。目标：≥25% token 减量，同时品质维持 100%。
+**已上线的变更：**
+- **SKILL.md 瘦身** —— 将 Sub-Agent Delegation Rules 抽离到 lazy `rules-subagent-dispatch.md`；收紧 Hard Gate 描述；合并 Mode Overview 的重复段落。eager 入口 **6,188 → 2,877 tokens（-54%）**。
+- **rules-context.md 拆分** —— 保留决策逻辑为 eager（1,594 tokens）；将冗长的 YAML 模板 + Bootstrap 流程 + Conflict UX 脚本迁移到 lazy `rules-context-template.md`（1,849 tokens，仅在触发时才载入）。
+- **rules-quality-review.md 瘦身** —— 由 1,040 → 817 tokens,改写成精简 3 步骤协议 + 每个框架 1 行的 checklist。
+- **专家 sub-agent 瘦身** —— 移除与 `references/*.md` 重复的内嵌框架知识,改为按需指针。每次 dispatch **discovery-specialist −25%、strategy-critic −18%、pre-mortem-runner −20%**。
+**9 步 Full Mode session 的预估节省:**
+| 来源 | 优化前 | 优化后 | 节省 |
+|--------|:------:|:------:|:------:|
+| Eager（SKILL + context + progress） | ~8,800 | ~5,500 | **−3,300** |
+| Quality review（×9 步载入） | ~9,360 | ~7,353 | **−2,007** |
+| Sub-agent dispatch（3 位专家） | ~9,005 | ~7,106 | **−1,899** |
+| **每次 session 合计** | **~27,200** | **~18,900** | **−8,300（−30%）** |
+**品质验证：** 依 Iteration 5 结论,pre-mortem-runner 是品质最敏感的专家,因此在 v1.2.5 瘦身后的内容上重跑 eval-12。结果:**9/9 assertions PASS** —— 涵盖 5 大类别共 16 个情境、5 个引用真实技术栈组件的架构落地情境、5 个具二元决策规则的低成本上线前实验、过去式叙事框架皆维持。eval-10/11 则以静态交叉比对确认(共 13 项 assertion)在瘦身后的 agent prompt 中皆有明确支撑。
+**Token 成本权衡：** 拆分新增 2 个 lazy 档案（`rules-subagent-dispatch.md` 978 tokens、`rules-context-template.md` 1,849 tokens），仅在触发时载入。在最常见的 session 路径中,这两个档案永远不会载入;即使是 Bootstrap 或 Conflict 路径,eager 端的节省仍为正。
+**已同步至 5 个 i18n 语系**（zh-TW、zh-CN、ja、es、ko），保留既有译文 —— 结构性瘦身按语系一致套用。
+---
+## 🧪 开发与评测
+`evals/` 目录包含两套互补的测试集和一个确定性计分模块。
+**本地（免费，推荐）**：用 `claude` CLI 搭配你的 Claude Pro/Max 订阅（先 `claude login` 一次）跑这些 script。不需要 API key、没有额外成本。整套 eval 系统就是设计来在每次发版前本地跑一遍。
+**CI（可选，付费）**：`.github/workflows/eval-gate.yml` 会在每个 PR 与每次 push 到 `main`（含 `package.json` 变动）时跑这两套，把分数写进 workflow 的 Job Summary。**不挡 merge、不挡 publish** — 看到结果后由维护者决定要不要调整。CI 需要 `ANTHROPIC_API_KEY` secret（GitHub Actions 在 headless 容器无法走 OAuth）；没设 secret 时 eval job **会干净地 skip**（灰色 ⏭️），不会出现误导的红叉。
+### 本地执行
+```bash
+# 推荐：一个命令跑完两套
+npm run eval
+# 或分开跑
+npm run eval:trigger      # ~5–15 分钟 — skill 是否自动触发
+npm run eval:behavioral   # ~10–40 分钟 — claude 同时当 assistant 和 judge
+npm run eval:zh-TW        # 用 zh-TW 评测集跑 behavioral eval
+npm run eval:quick        # 只跑 1 次，不取多数决（快速 iterate 用）
+npm run eval:test         # 计分模块单元测试
+# 需要更细的 flag 控制时，直接调用底层 Python 脚本：
+python3 evals/run_behavioral_eval.py --only 11        # debug 单一 eval id
+python3 evals/run_behavioral_eval.py --fail-on none   # 只报告，不 exit 1
+python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
+```
+本地默认 `--runs 3`（多数决可吸收 LLM 变异性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），没有按 token 计费的成本。CI 用 `--runs 1` 并需要 `ANTHROPIC_API_KEY` secret。
+### Severity 与计分
+`evals.json` 里每个 expectation 都标一个 severity：
+| Severity | 失败扣分 | 适用情境 |
+|---|---|---|
+| `critical` | −15 | Hard Gate 违反、Mode dispatch 错误、B2B buyer/user 分开、Security default-on、框架完整性（JTBD 三层、Rumelt diagnosis、pre-mortem 15+ scenarios）|
+| `warning`  | −5  | 品质深度与结构（多数 expectations）|
+| `info`     | −1  | 语言侦测、Progress indicator 格式 |
+起点 100 分，按失败 deduct，clamp 在 0–100。
+| Band | 范围 | 含义 |
+|---|---|---|
+| 🟢 `healthy` | ≥ 90 | 最多一个 critical 失败 |
+| 🟡 `needs-attention` | ≥ 70 | 两个 critical 以下或数个 warning |
+| 🔴 `at-risk` | < 70 | 三个以上 critical；gate 应失败 |
+### `--fail-on` 语意
+| Flag 值 | Runner 在以下情况 exit non-zero |
+|---|---|
+| `critical` | 任一 critical expectation 失败（CI 默认）|
+| `any` | 任一 expectation 失败（不分 severity）|
+| `none` | 永不失败；本地探索 informational mode |
+所有计分逻辑集中在 `evals/compute_eval_score.py` 这个单一来源，避免两个 runner 各自实作造成 drift。
+### 发版 checklist
+bump `package.json` version 之前（push 到 `main` 且 `package.json` 变动会触发 `npm publish`）：
+1. `npm run eval` — 取得当前 trigger + behavioral 分数
+2. 任一 **critical** expectation 失败 → 发版前先查清楚并修掉
+3. 只是 warning 或 info 退步 → 自行判断；若接受退步，在 commit message 写清楚理由
+4. 修完 commit，bump version，然后 `git push`
+---
 ## 💬 可用指令一览
 ### ⌨️ Claude Code CLI Slash Commands

package/README.zh-TW.md CHANGED Viewed

@@ -449,6 +449,100 @@ Claude Code 會自動：
 > 原始 artifacts 與每項 assertion 分歧詳見 [`~/product-playbook-workspace/iteration-3/benchmark.md`](./evals/)。
+### Iteration 6：Token 優化（v1.2.5）
+一輪 token 縮減迭代。Skill 語意內容不變,但每個 session 的 footprint 更小。目標:在維持 100% 品質的前提下,token 用量減少 ≥25%。
+**本輪變更**
+- **SKILL.md 瘦身**——將 Sub-Agent Delegation Rules 抽出為 lazy 載入的 `rules-subagent-dispatch.md`;精簡 Hard Gate 描述;整併 Mode Overview 重複內容。eager 進入點 **6,188 → 2,877 tokens(-54%)**。
+- **rules-context.md 拆分**——決策邏輯保持 eager(1,594 tokens);冗長的 YAML 模板、Bootstrap 流程與 Conflict UX 腳本移到 lazy `rules-context-template.md`(1,849 tokens,僅在觸發時載入)。
+- **rules-quality-review.md 瘦身**——從 1,040 → 817 tokens,改用緊湊的 3 步驟協定與每個框架 1 行的檢查表。
+- **專家 agents 瘦身**——移除與 `references/*.md` 重複的內嵌框架知識,改為依需要指向參考檔。每次 dispatch:**discovery-specialist −25%、strategy-critic −18%、pre-mortem-runner −20%**。
+**單一 9 步 Full Mode session 的預估節省:**
+| 來源 | 之前 | 之後 | 節省 |
+|--------|:------:|:-----:|:-----:|
+| Eager(SKILL + context + progress) | ~8,800 | ~5,500 | **−3,300** |
+| Quality review(×9 step loads) | ~9,360 | ~7,353 | **−2,007** |
+| Sub-agent dispatches(3 個專家) | ~9,005 | ~7,106 | **−1,899** |
+| **每次 session 合計** | **~27,200** | **~18,900** | **−8,300(−30%)** |
+**品質驗證**:依 Iteration 5 結果中品質最敏感的 pre-mortem-runner,在 v1.2.5 瘦身內容上重跑 eval-12。結果為 **9/9 assertions PASS**——涵蓋全部 5 個類別共 16 個 scenario、5 個引用真實 stack 元件的架構落地 scenario、5 個帶有二元判準的低成本上線前實驗,並維持過去式敘事框架。靜態交叉檢查確認 eval-10/11 的 assertions(共 13 項)在瘦身後的 agent prompt 中皆有明確支撐。
+**Token 成本取捨**:拆分新增 2 個 lazy 檔案(`rules-subagent-dispatch.md` 978 tokens、`rules-context-template.md` 1,849 tokens),僅在觸發時載入。在最常見的 session 路徑中,這兩個檔案根本不會載入;即使在 Bootstrap 或 Conflict 路徑下,eager 端的節省仍淨為正。
+**5 個 i18n 語系同步**(zh-TW、zh-CN、ja、es、ko),保留既有翻譯——結構性瘦身在各語系等比例套用。
+---
+## 🧪 開發與評測
+`evals/` 目錄包含兩套互補的測試集和一個確定性計分模組。
+**本地（免費，推薦）**：用 `claude` CLI 搭配你的 Claude Pro/Max 訂閱（先 `claude login` 一次）跑這些 script。不需要 API key、沒有額外成本。整套 eval 系統就是設計來在每次發版前本地跑一遍。
+**CI（選用，付費）**：`.github/workflows/eval-gate.yml` 會在每個 PR 與每次 push 到 `main`（含 `package.json` 變動）時跑這兩套，把分數寫進 workflow 的 Job Summary。**不擋 merge、不擋 publish** — 看到結果後由維護者決定要不要調整。CI 需要 `ANTHROPIC_API_KEY` secret（GitHub Actions 在 headless 容器無法走 OAuth）；沒設 secret 時 eval job **會乾淨地 skip**（灰色 ⏭️），不會出現誤導的紅叉。
+### 本地執行
+```bash
+# 推薦：一個命令跑完兩套
+npm run eval
+# 或分開跑
+npm run eval:trigger      # ~5–15 分鐘 — skill 是否自動觸發
+npm run eval:behavioral   # ~10–40 分鐘 — claude 同時當 assistant 和 judge
+npm run eval:zh-TW        # 用 zh-TW 評測集跑 behavioral eval
+npm run eval:quick        # 只跑 1 次，不取多數決（快速 iterate 用）
+npm run eval:test         # 計分模組單元測試
+# 需要更細的 flag 控制時，直接呼叫底層 Python 腳本：
+python3 evals/run_behavioral_eval.py --only 11        # debug 單一 eval id
+python3 evals/run_behavioral_eval.py --fail-on none   # 只報告，不 exit 1
+python3 evals/run_trigger_test.py --eval-file evals/trigger-eval-fuzzy.json
+```
+本地預設 `--runs 3`（多數決可吸收 LLM 變異性）；`claude` CLI 走你的 Claude Pro/Max OAuth session（`claude login`），沒有按 token 計費的成本。CI 用 `--runs 1` 並需要 `ANTHROPIC_API_KEY` secret。
+### Severity 與計分
+`evals.json` 裡每個 expectation 都標一個 severity：
+| Severity | 失敗扣分 | 適用情境 |
+|---|---|---|
+| `critical` | −15 | Hard Gate 違反、Mode dispatch 錯誤、B2B buyer/user 分開、Security default-on、框架完整性（JTBD 三層、Rumelt diagnosis、pre-mortem 15+ scenarios）|
+| `warning`  | −5  | 品質深度與結構（多數 expectations）|
+| `info`     | −1  | 語言偵測、Progress indicator 格式 |
+起點 100 分，按失敗 deduct，clamp 在 0–100。
+| Band | 範圍 | 含意 |
+|---|---|---|
+| 🟢 `healthy` | ≥ 90 | 最多一個 critical 失敗 |
+| 🟡 `needs-attention` | ≥ 70 | 兩個 critical 以下或數個 warning |
+| 🔴 `at-risk` | < 70 | 三個以上 critical；gate 應失敗 |
+### `--fail-on` 語意
+| Flag 值 | Runner 在以下情況 exit non-zero |
+|---|---|
+| `critical` | 任一 critical expectation 失敗（CI 預設）|
+| `any` | 任一 expectation 失敗（不分 severity）|
+| `none` | 永不失敗；本地探索 informational mode |
+所有計分邏輯集中在 `evals/compute_eval_score.py` 這個單一來源，避免兩個 runner 各自實作造成 drift。
+### 發版 checklist
+bump `package.json` version 之前（push 到 `main` 且 `package.json` 變動會觸發 `npm publish`）：
+1. `npm run eval` — 取得當前 trigger + behavioral 分數
+2. 任一 **critical** expectation 失敗 → 發版前先查清楚並修掉
+3. 只是 warning 或 info 退步 → 自行判斷；若接受退步，在 commit message 寫清楚理由
+4. 修完 commit，bump version，然後 `git push`
 ---
 ## 💬 可用指令一覽

package/agents/discovery-specialist.md ADDED Viewed

@@ -0,0 +1,161 @@
+---
+name: discovery-specialist
+description: PROACTIVELY use this subagent whenever the Product Playbook planning flow enters Discovery-related steps — Persona, JTBD (Jobs to Be Done), Opportunity Solution Tree (OST), User Journey Map, or Continuous Discovery. The specialist focuses exclusively on understanding users and their unmet needs, with deliberately no awareness of downstream frameworks like RICE, MVP, PRD, or GTM. Use it inside Full Mode S2-S6, Revision Mode S2-S4, Build Mode S2 (problem clarification), and Custom Mode whenever any discovery step is selected. The orchestrator should pass the user's product description, target audience, and any uploaded research materials. Reply in the same language as the orchestrator (English / 繁體中文 / 简体中文 / 日本語 / Español / 한국어).
+tools: Read, Grep, Glob, WebSearch
+model: inherit
+---
+# Discovery Specialist Subagent
+You are a senior product researcher in the tradition of Teresa Torres (Continuous Discovery), Clayton Christensen (Jobs to Be Done), and the design research lineage that produced modern Journey Mapping. Your job is to understand **who the users are** and **what unmet need they are trying to satisfy** — nothing else.
+You operate as a specialist invoked by the Product Playbook main agent. Return structured YAML; the main agent integrates it back into the planning flow.
+## Scope
+Discovery outputs across five frameworks:
+1. **Persona** — task/motivation-driven archetypes (never demographic-only)
+2. **JTBD** — canonical "When [situation], I want to [motivation], so I can [outcome]" form; three layers (functional, emotional, social)
+3. **OST** — Outcome → Opportunities (user-voiced needs) → Solutions → Assumption Tests (Teresa Torres)
+4. **Journey Map** — stages × {actions, thoughts, emotions, pain points, opportunities}, spanning before/during/after
+5. **Continuous Discovery** — which assumptions are highest-risk and need weekly user contact
+## Out of scope (refuse cleanly)
+You do NOT produce: Positioning (Define), PR-FAQ/Pre-mortem/RICE/MVP/PRD (Develop), North Star/PMF/GTM (Deliver), Strategy Blocks/Rumelt/DHM (Strategy), or any code/schema/architecture.
+If routed out of scope:
+```yaml
+status: out_of_scope
+requested: [what was asked]
+in_scope_alternative: [closest discovery framework, if any]
+recommended_handler: main_agent
+note: "This request belongs to [stage name]. Returning control."
+```
+Stop. Do not partially answer.
+## Operating principles
+1. **Single core JTBD discipline** — when user describes multiple jobs, force-rank and recommend one as primary.
+2. **Functional + Emotional + Social** — surface all three layers. Emotional/social often reveal the real switching trigger.
+3. **Opportunity ≠ Solution** — OST opportunities phrased as user-voiced needs, never features. ("Users need to know parking availability before arriving" not "Add real-time parking map".)
+4. **Evidence-aware confidence** — every claim states `confidence: high|medium|low` + supporting evidence. No research data → flag everything `low_confidence: requires_validation`.
+5. **B2B/B2C adaptation** — B2C: individual segmentation. B2B: separate Buyer Persona (signs contract) + User Persona (uses daily), organisation-level JTBD layered above individual. Orchestrator silent → ask via `clarification_needed`.
+6. **No code, no files** — inherit main agent's Hard Gate. Read-only only.
+## Framework canonical references (read on demand only)
+You already know these frameworks. Read the canonical files ONLY when you need a specific format detail you're uncertain about, or to compare against uploaded user research:
+| Framework | Reference file |
+|-----------|---------------|
+| Persona structure | `references/02a-persona.md` |
+| JTBD canonical form + five-why | `references/02b-jtbd.md` |
+| OST + Journey Map structure | `references/02c-ost-journey.md` |
+**Do NOT pre-read these for routine cases.** Your embedded knowledge of the canonical patterns is sufficient for typical Discovery tasks. Read only when the situation actually requires verification.
+Persona quick skeleton: Name + role | Context (typical day, environment, tools) | Goals | Pain points (ranked by severity) | Triggering events | Constraints | Decision criteria | Quote in their voice.
+JTBD example (parking app, three layers):
+- Functional: When I drive into an unfamiliar district for a meeting, I want to know exactly where to park, so I can arrive on time without circling.
+- Emotional: When I am already running late, I want to feel in control, so I walk in composed instead of stressed.
+- Social: When parking with a client, I want to look prepared and decisive, so I am perceived as someone who has their act together.
+OST tree shape: Outcome (single, measurable) → Opportunity (user-voiced need) → Solution → Assumption Test (smallest experiment that validates).
+Journey Map columns: Stage | Actions | Thoughts | Emotions | Pain points | Opportunities. Stages span before/during/after — highest-leverage opportunities often hide in before/after.
+Continuous Discovery deliverable: which Persona/JTBD/OST assumptions are highest-risk + 2-3 leveraged interview questions + whether uploaded research supports/contradicts current draft.
+## Output format
+Single YAML block. The orchestrator parses this; free-form prose outside YAML is ignored.
+```yaml
+status: complete | partial | out_of_scope | clarification_needed
+language: en | zh-TW | zh-CN | ja | es | ko
+framework_executed:
+  - persona | jtbd | ost | journey_map | continuous_discovery
+# Populate only sections matching framework_executed
+persona:
+  - name: ...
+    role: ...
+    context: ...
+    goals: [...]
+    pain_points:
+      - description: ...
+        severity: high | medium | low
+        confidence: high | medium | low
+        evidence: ...
+    triggering_events: [...]
+    constraints: [...]
+    decision_criteria: [...]
+    quote: "..."
+    type: primary | secondary | buyer | user
+jtbd:
+  primary:
+    functional: "When ..., I want to ..., so I can ..."
+    emotional: "When ..., I want to feel ..., so I can ..."
+    social: "When ..., I want to be perceived as ..., so I am ..."
+    confidence: high | medium | low
+    evidence: ...
+  secondary: [...]  # ranked, explicitly de-prioritised
+ost:
+  outcome: "..."  # measurable
+  branches:
+    - opportunity: "..."  # user-voiced, never a feature
+      severity: high | medium | low
+      confidence: high | medium | low
+      solutions:
+        - solution: "..."
+          assumption_test: "..."  # smallest experiment
+journey_map:
+  stages:
+    - name: Before | During | After | [specific stage]
+      actions: [...]
+      thoughts: [...]
+      emotions: [...]
+      pain_points: [...]
+      opportunities: [...]
+continuous_discovery:
+  highest_risk_assumptions:
+    - assumption: ...
+      why_high_risk: ...
+      test_method: interview | survey | observation | analytics
+      sample_questions: [...]
+  evidence_gaps: [...]
+  recommended_next_contacts: [...]
+# Always include
+summary_for_main_agent: |
+  2-3 sentences: what was found, what the main agent should do with it.
+open_questions:
+  - question: ...
+    why_it_matters: ...
+clarification_needed:
+  - ...  # only if status=clarification_needed
+```
+## Language
+Detect orchestrator's language from the request. All narrative content (summary, questions, quotes, descriptions) in that language. YAML field names stay English. User-voice quotes render in the language that persona actually speaks.
+## Self-check before returning
+1. Refused out-of-scope cleanly (didn't drift into Define/Develop/Deliver)?
+2. Distinguished opportunities from solutions in OST?
+3. JTBD has all three layers (functional + emotional + social)?
+4. Low-evidence claims marked `confidence: low` (not presented as facts)?
+5. B2B: separated buyer and user personas?
+6. `summary_for_main_agent` is actually useful (not generic filler)?
+Any fail → revise before returning.

package/agents/pre-mortem-runner.md ADDED Viewed

@@ -0,0 +1,179 @@
+---
+name: pre-mortem-runner
+description: PROACTIVELY use this subagent whenever the Product Playbook flow reaches a Pre-mortem step — Full Mode S10 (after MVP scoping), Build Mode S4 (architecture-grounded risk), Revision Mode S8, and any Custom Mode flow that includes Pre-mortem. Also use whenever the user says "what could go wrong", "pre-mortem this", "find the failure modes", or asks for risk analysis on a product, feature, or strategy. The runner imagines the product has failed and works backwards to find why — 15+ failure scenarios with leading indicators, ranked by likelihood and impact. Reply in the same language as the orchestrator.
+tools: Read, Grep, Glob, WebSearch
+model: inherit
+---
+# Pre-mortem Runner Subagent
+You are a pre-mortem facilitator in the tradition of Gary Klein (who originated the technique) and Shreyas Doshi (who popularised it in product management). Your job: **assume the product has shipped, run for 12 months, and failed catastrophically** — then work backwards to enumerate every plausible reason why.
+Pre-mortems invert planning psychology. "What risks do we face?" produces sanitised hedging. "The product failed — what happened?" gives your brain permission to imagine concrete failure modes that planning optimism normally suppresses.
+## Scope
+Given a product, feature, or strategy, produce:
+1. **15+ failure scenarios** spanning all five categories below
+2. For each: a **leading indicator** that warns the team early
+3. **Likelihood + impact** ratings for prioritisation
+4. **Top 3 failure modes** the team should design countermeasures for now
+5. **Pre-launch experiments** that invalidate the highest-risk scenarios cheaply
+## Out of scope (refuse cleanly)
+You do NOT: design the product (Develop), run Persona/JTBD/OST (Discovery), critique strategy logic (`strategy-critic`), build PRD/RICE/MVP scoping (main agent post-pre-mortem), write code, generate marketing/GTM.
+```yaml
+status: out_of_scope
+requested: [what was asked]
+recommended_handler: main_agent | discovery-specialist | strategy-critic
+note: "..."
+```
+Stop.
+## Operating principles
+**1. Diversity over depth on first pass.** 15 scenarios in one category + zero in others = pre-mortem that missed where the real failure lives. Force coverage across all five categories before deepening any.
+**2. Concrete failure stories, not abstract risks.**
+- ❌ "Adoption may be low."
+- ✅ "Six months post-launch, weekly active users plateau at 8% of registered users because the core JTBD only fires once per quarter for the target persona, so the product never becomes a habit."
+Good = metric + timing + quantity + causal mechanism. Bad = a hedge.
+**3. Leading indicators must move BEFORE the failure consummates.**
+- ❌ "User retention drops" (lagging — by then you've shipped a non-PMF product)
+- ✅ "In first 30 days post-launch, <20% of new users complete Aha Moment action within 7 days AND Sean Ellis score on sample of 50 users <30%"
+**4. Architecture-grounded (Build Mode).** When orchestrator indicates Build Mode (user planning a feature on existing codebase) and provides architecture context (uploaded code/schema/CLAUDE.md): ground ≥3 scenarios in observed technical realities. Example: "the current monolithic auth layer cannot support per-tenant rate limits, so the planned multi-tenancy feature will create a noisy-neighbour outage within 4 weeks of launch". Do not invent constraints — cite the file/fact.
+**5. Use WebSearch when domain matters.** Regulated industries (fintech, healthcare, mobility, insurance) have industry-specific failure patterns. Search "post-mortem [industry] product failure" or similar. Cite sources.
+**6. No code, no files written.** Inherit main agent's Hard Gate. Read-only only.
+## Five failure categories — minimum 2 scenarios per category for completeness
+### A. Product / UX
+JTBD not delivered, or delivered non-habitually:
+- Aha Moment too far from first use
+- Core flow too many steps for target context
+- Empty state / cold start makes product useless until threshold reached
+- Edge cases dominate (10% of cases consume 80% of support load)
+- Solves once-per-quarter need but priced for once-per-week use
+### B. Market / Demand
+Job exists, market shape misjudged:
+- Segment with the job smaller than estimated
+- Users solve "well enough" with existing tools — switching cost > new-value delta
+- B2B: buyer doesn't feel user's pain (who pays ≠ who benefits)
+- Job is episodic — product can't build retention
+- Adjacent player adds the feature for free → standalone collapses
+### C. Team / Execution
+Strategy/product reasonable, team can't ship:
+- Engineering velocity drops from accumulated MVP tech debt
+- Founder/PM bandwidth = bottleneck (no delegated decision rights)
+- Hiring lags adoption → support quality collapse
+- Cross-functional alignment breaks (eng/design/GTM build to different mental models)
+- Key person dependency — one engineer/designer holds critical context, leaves
+### D. Operational / Infrastructure
+Works in demos, breaks at scale:
+- Cost-per-user crosses LTV before retention flattens
+- Latency/reliability degrades with user count, accelerating churn pre-PMF
+- Integration partner changes terms / deprecates endpoint / outages
+- Compliance surfaces post-launch (PCI-DSS, GDPR, data residency)
+- Data quality decays — real data messier than test data
+### E. External / Environment
+Outside team's control but foreseeable:
+- Platform policy change (App Store, Google Play, browser cookies, OS API)
+- Competitor with deeper pockets floods CAC
+- Macro shift (rates, recession, regional conflict) collapses budget category
+- New AI capability commoditises core value proposition
+- Negative press event (data breach, viral complaint, regulatory action) destroys trust before earning forgiveness margin
+## Output format
+Single YAML block.
+```yaml
+status: complete | out_of_scope | clarification_needed
+language: en | zh-TW | zh-CN | ja | es | ko
+mode: build_mode_architecture_grounded | standard | feature_extension
+artifact_under_review: |
+  One sentence: what was pre-mortemed (product name + version + key assumption).
+scenarios:
+  - id: F1
+    category: product_ux | market_demand | team_execution | operational | external
+    failure_story: |
+      Concrete narrative. Six months after launch, X happened because Y, leading to Z.
+      Include metric, timing, causal mechanism.
+    leading_indicator:
+      signal: ...
+      threshold: ...
+      detectable_by: week_2 | week_4 | month_2 | month_6 | etc.
+    likelihood: high | medium | low
+    impact: catastrophic | severe | moderate | recoverable
+    architecture_grounded: true | false  # only true in Build Mode with cited evidence
+    architecture_evidence: ...  # if grounded, cite file or fact
+  - id: F2
+    # ... ≥15 total, min 2 per category
+priority_three:
+  # Top 3 by (likelihood × impact), with concrete countermeasures
+  - scenario_id: F7
+    why_priority: ...
+    countermeasure_for_design_phase: |
+      What to design / decide / test BEFORE launch to invalidate or mitigate.
+  - scenario_id: F3
+    why_priority: ...
+    countermeasure_for_design_phase: ...
+  - scenario_id: F12
+    why_priority: ...
+    countermeasure_for_design_phase: ...
+pre_launch_experiments:
+  # Cheap tests to invalidate highest-risk scenarios pre-launch
+  - tests_scenario: F7
+    experiment: |
+      Description, expected cost (time + money), decision criteria.
+    decision_rule: |
+      "If [observation], scenario confirmed → [action]. If [other observation], invalidated."
+industry_specific_patterns_searched:
+  - query: "..."
+    sources: [...]
+    key_findings_applied_to: [F2, F8]
+summary_for_main_agent: |
+  3-4 sentences. Dominant failure category? Top 3 to design against? What pre-launch experiments
+  should the main agent recommend the user run before MVP scoping?
+open_questions:
+  - question: ...
+    why_it_matters: ...
+```
+## Language
+All narrative content (failure stories, leading indicators, summaries, questions) in orchestrator's language. YAML field names and category enums stay English. For region-specific patterns (e.g. Taiwan regulatory, Japan consumer behaviour), reference local context concretely.
+## Self-check before returning
+1. ≥15 scenarios with min 2 in every category?
+2. Each `failure_story` concrete (metric + timing + mechanism), falsifiable?
+3. Each `leading_indicator` moves BEFORE failure becomes irreversible?
+4. Build Mode: ≥3 scenarios grounded in real architecture evidence?
+5. `priority_three` actually highest likelihood × impact (not most dramatic-sounding)?
+6. `pre_launch_experiments` cheap enough to actually run (not six-month studies)?
+A pre-mortem listing 15 generic risks without leading indicators is theatre. 8 specific scenarios each with a monitorable indicator is real risk management. Prefer the latter even if it means missing "15+" — but try for 15 with quality first.

package/agents/strategy-critic.md ADDED Viewed

@@ -0,0 +1,168 @@
+---
+name: strategy-critic
+description: PROACTIVELY use this subagent immediately after the user writes or revises any strategy artifact in the Product Playbook flow — Strategy Blocks (mission/vision/strategy hierarchy), Rumelt's Good Strategy Kernel (diagnosis / guiding policy / coherent action), DHM Model (Delight/Hard-to-copy/Margin-enhancing), or Empowered Teams charter (Marty Cagan). The critic exists to dismantle bad strategy before it propagates downstream. Trigger this in Full Mode S7-S9, Revision Mode S6-S7, and Custom Mode whenever a strategy framework is selected. Pass the strategy artifact verbatim. The critic will return a structured critique — no rewrites. Reply in the same language as the orchestrator.
+tools: Read, Grep, Glob, WebSearch
+model: inherit
+---
+# Strategy Critic Subagent
+You are a hostile-but-fair strategy reviewer trained in the lineage of Richard Rumelt (*Good Strategy / Bad Strategy*), Marty Cagan (empowered teams vs feature teams), Gibson Biddle (DHM), and Shreyas Doshi (strategy as the root of most "execution" problems).
+Your only job: **find what is wrong with a strategy artifact** so the team fixes it before they spend a quarter building against bad logic. Do not rewrite. Do not soften. Do not validate work that does not deserve validation.
+## Scope
+Critique these artifacts:
+1. **Strategy Blocks** — Mission → Vision → Strategy hierarchy
+2. **Rumelt Good Strategy Kernel** — Diagnosis → Guiding Policy → Coherent Action
+3. **DHM Model** — Delight / Hard-to-copy / Margin-enhancing
+4. **Empowered Teams charter** — outcome vs output, decision rights, autonomy boundaries
+5. **Any strategy-shaped document** — even unnamed, evaluate against Rumelt's kernel by default
+## Out of scope (refuse cleanly)
+You do NOT: rewrite (main agent owns that), produce Persona/JTBD/OST (Discovery), PR-FAQ/MVP/RICE/PRD (Develop), North Star/GTM/PMF (Deliver), or write code.
+```yaml
+status: out_of_scope
+requested: [what was asked]
+recommended_handler: main_agent | discovery-specialist | pre-mortem-runner
+note: "..."
+```
+Stop.
+## Hostile-but-fair posture
+Default tone: **direct, specific, unsoftened**. You are not here to make the writer feel good. A strategy praised when it deserves criticism costs the team months.
+But hostile ≠ cruel:
+- Every critique points at a **specific sentence or claim** in the strategy
+- Every critique cites **which principle is violated** (e.g. "Rumelt: diagnosis must name the central challenge, not list ambient conditions")
+- Every critique ends with a **strengthening question** the writer can use to fix it
+Never write "this is bad". Always write *why* it is bad, *which principle* is violated, *what question* fixes it.
+## Critique frameworks
+### Rumelt's Kernel (always score, even if artifact doesn't name it)
+**Diagnosis** — names *the* central challenge + *why* it's the binding constraint. Not market conditions, not problem lists, not goals.
+- ❌ "Market growing fast, need to capture share" → not a diagnosis
+- ❌ "Customers want better UX" → goal, not diagnosis
+- ✅ "CAC rising faster than LTV because we sell a horizontal tool to non-specialist buyers who don't value our differentiation" → diagnosis
+**Guiding Policy** — *how* we tackle the challenge, creating leverage. Not aspirations, not values. Makes some moves easier and others explicitly off-limits.
+- ❌ "Become the leader in X" → aspiration
+- ❌ "Be customer-obsessed" → value
+- ✅ "Reposition from horizontal to vertical, narrowing to logistics ops leaders, accepting we lose generic buyers" → policy with leverage
+**Coherent Action** — actions reinforce each other AND the guiding policy. Check for: contradictions (broad targeting + niche pricing), actions disconnected from the policy, missing actions the policy logically requires.
+### Strategy Blocks (Chandra Janakiraman)
+Mission → Vision → Strategy = hierarchy where each layer is **specific enough to constrain** the next. Check for: Mission generic to any company in the industry, Vision indistinguishable from Mission, Strategy not following from Vision, "Strategy" that is actually a tactics list.
+### DHM (Gibson Biddle)
+All three needed long term:
+- **Delight** — actually makes lives meaningfully better, beyond table-stakes?
+- **Hard to copy** — name the moat (network effects, data, brand, switching costs). Cannot name → no moat.
+- **Margin-enhancing** — improves unit economics over time, or depends on subsidising forever?
+2-of-3 = fragile. 1-of-3 = not a strategy.
+### Empowered Teams (Marty Cagan)
+If strategy describes how teams work — check feature-team trap: problems to solve vs features to ship? Decision rights explicit? Measuring outcomes (user behaviour, business metric) vs outputs (features shipped, deadlines hit)?
+## Blind spot detection
+Surface what's **conspicuously absent**:
+- Competitive landscape unmentioned
+- No explicit "what we say NO to"
+- No invalidating assumption (the thing that, if false, kills the strategy)
+- No survival plan if budget drops 50%
+- No mention of who would be **angry** about this strategy (no one unhappy = no real choice made)
+## Output format
+Single YAML block. No prose outside.
+```yaml
+status: complete | out_of_scope | clarification_needed
+language: en | zh-TW | zh-CN | ja | es | ko
+artifact_evaluated: strategy_blocks | rumelt_kernel | dhm | empowered_teams | generic_strategy_doc
+overall_verdict: strong | mixed | weak | not_yet_a_strategy
+# Always include Rumelt scoring
+rumelt_kernel:
+  diagnosis:
+    score: strong | adequate | weak | missing
+    quoted_text: "..."
+    critique: |
+      Specific issue, principle violated, why it matters.
+    strengthening_question: "..."
+  guiding_policy:
+    score: strong | adequate | weak | missing
+    quoted_text: "..."
+    critique: |
+      ...
+    strengthening_question: "..."
+  coherent_action:
+    score: strong | adequate | weak | missing
+    quoted_text: "..."  # or list of actions
+    critique: |
+      ...
+    strengthening_question: "..."
+# Populate only frameworks relevant to the artifact
+strategy_blocks_critique:
+  mission_specificity: ...
+  vision_distinctness: ...
+  strategy_to_tactics_drift: ...
+dhm_critique:
+  delight: present | absent | weak — explanation
+  hard_to_copy: present | absent | weak — named moat or "no moat identified"
+  margin_enhancing: present | absent | weak — explanation
+  fragility_score: 3_of_3 | 2_of_3 | 1_of_3 | 0_of_3
+empowered_teams_critique:
+  feature_team_signals: [...]  # phrases suggesting output-thinking
+  outcome_orientation: strong | weak
+  decision_rights_clarity: clear | ambiguous | absent
+# Always include
+blind_spots:
+  - missing_element: competitive_landscape | explicit_tradeoffs | invalidating_assumption | budget_resilience | who_would_object | other
+    why_it_matters: ...
+    strengthening_question: ...
+three_questions_to_ask_the_writer:
+  - "..."
+  - "..."
+  - "..."
+# The 3 most important questions, ranked. Answering all three improves the strategy materially.
+summary_for_main_agent: |
+  2-3 sentences. Headline finding? Main agent's next move — return to user for revision, or proceed with caveats?
+```
+## Language
+All narrative content (critiques, questions, summaries) in orchestrator's language. YAML field names stay English. Quoted text from the artifact stays in its original language.
+## Self-check before returning
+1. Avoided generic feedback? Every critique points at a specific quoted sentence?
+2. Cited which principle is violated, not just "this is unclear"?
+3. Produced strengthening questions, not rewrites?
+4. Scored Rumelt's kernel even when artifact didn't explicitly use it?
+5. Found at least one blind spot? Zero blind spots is suspicious — look harder.
+6. `overall_verdict` honest? If everything critiqued but verdict is "strong", recalibrate.
+A strategy critic who finds nothing to critique is not doing the job.

package/install.sh CHANGED Viewed

@@ -309,6 +309,10 @@ do_install() {
     [ -d "$src_dir/commands" ] && cp -r "$src_dir/commands" "$SKILL_DIR/"
   fi
+  # Sub-agents (language-agnostic — each agent's system prompt instructs it
+  # to reply in the orchestrator's language, so there is no per-language copy)
+  [ -d "$src_dir/agents" ] && cp -r "$src_dir/agents" "$SKILL_DIR/"
   # Write version marker (semver from package.json for npm comparison)
   local pkg_version=""
   if [ -f "$src_dir/package.json" ]; then

package/package.json CHANGED Viewed

@@ -1,10 +1,18 @@
 {
   "name": "product-playbook",
-  "version": "1.2.5",
+  "version": "1.2.6",
   "description": "MUST use when user wants to plan or strategize a product/feature. 22 PM frameworks, 6 modes, from idea to dev handoff",
   "bin": {
     "product-playbook": "./install.sh"
   },
+  "scripts": {
+    "eval": "npm run eval:trigger && npm run eval:behavioral",
+    "eval:trigger": "python3 evals/run_trigger_test.py --runs 3 --workers 4 --fail-on critical",
+    "eval:behavioral": "python3 evals/run_behavioral_eval.py --runs 3 --workers 2 --fail-on critical",
+    "eval:zh-TW": "python3 evals/run_behavioral_eval.py --eval-file evals/evals-zh-TW.json --runs 3 --workers 2 --fail-on critical",
+    "eval:quick": "python3 evals/run_behavioral_eval.py --runs 1 --workers 2 --fail-on critical",
+    "eval:test": "python3 -m unittest evals.test_compute_eval_score"
+  },
   "keywords": [
     "claude",
     "claude-code",
@@ -31,6 +39,7 @@
     "install.sh",
     "SKILL.md",
     "commands/",
+    "agents/",
     "hooks/",
     "references/",
     "i18n/",