npm - @brunosps00/dev-workflow - Versions diffs - 0.13.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.13.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/README.md CHANGED Viewed

@@ -269,8 +269,10 @@ These are not slash commands — they are primitives other commands call to enfo
 | Skill | Description | Source | License |
 |-------|-------------|--------|---------|
-| **dw-ui-discipline** | UI doctrine: 4-checkpoint hard-gate (brand authorities or curated defaults, surface job sentence, state matrix, scene sentence), 14 anti-slop patterns + 17 anti-defaults, WCAG 2.2 AA floor with verification recipes, 10 curated palette/font defaults for bootstrap | [pedronauck/skills](https://github.com/pedronauck/skills) `ui-craft` | MIT |
-| **dw-testing-discipline** | Testing doctrine: Six Iron Laws, 12 positive patterns, 25 anti-patterns across 5 families (Brittleness/Flakiness/Mock-misuse/Process/AI-specific), 7 mandatory AI agent gates, flaky discipline + SLOs, Playwright recipes, browser security-boundary patterns | [pedronauck/skills](https://github.com/pedronauck/skills) `testing-boss` + [`addyosmani/agent-skills`](https://github.com/addyosmani/agent-skills) | MIT |
+| **dw-ui-discipline** | UI doctrine: 4 grounding questions (design source, surface job, state matrix, who/where/light/mood), 14 visual-slop patterns + 17 anti-defaults, WCAG 2.2 AA floor with verification recipes, 10 curated palette/font defaults for bootstrap | dev-workflow (original work) | MIT |
+| **dw-testing-discipline** | Testing doctrine: six core rules, 12 positive patterns, 25 anti-patterns across 4 families (fragile/non-deterministic/mock-driven/suite-hygiene), 6 mandatory agent guardrails, flaky discipline + SLOs, Playwright recipes, browser security-boundary patterns | dev-workflow (original work) + browser-DevTools patterns from [`addyosmani/agent-skills`](https://github.com/addyosmani/agent-skills) | MIT |
+| **dw-incident-response** | Five-phase incident workflow (triage → investigation → resolution → communication → postmortem) with checkpoints and structured outputs to `.dw/incidents/`. Severity classification (SEV-1..4), runbook templates, on-call handoff, blameless postmortem template, action-item quality bar. | [wilsto/claude-code-starter-kit](https://github.com/wilsto/claude-code-starter-kit) (MIT, credits `wshobson/agents` v1.3.0) | MIT |
+| **dw-llm-eval** | LLM/AI evaluation doctrine: five-rung oracle ladder (exact → schema → outcome → LLM-as-judge → human), judge calibration (Spearman ≥0.80 against humans), reference-dataset principle (20 from real failures > 200 synthetic), RAG metrics (precision@k + faithfulness + utilization), agent eval (outcome-vs-trajectory + 4 trajectory match modes) | Trajectory match modes from [`langchain-ai/agentevals`](https://github.com/langchain-ai/agentevals) (MIT); other patterns distilled from open evaluations literature | MIT |
 | **vercel-react-best-practices** | 67 React/Next.js performance optimization rules across 8 priority categories. Wraps the rules with `references/perf-discipline.md` (measure → identify → fix → verify → guard) so perf work is data-driven, not vibes-based | [Vercel Labs](https://github.com/vercel-labs/agent-skills) + [`addyosmani/agent-skills`](https://github.com/addyosmani/agent-skills) | MIT |
 | **security-review** | Systematic vulnerability review based on OWASP with confidence-based reporting | [OWASP Cheat Sheet Series](https://cheatsheetseries.owasp.org/) | CC BY-SA 4.0 |
 | **humanizer** | Detects and removes 24 AI writing patterns based on Wikipedia's "Signs of AI Writing" guide | [Wikipedia AI Writing Guide](https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing) | -- |
@@ -318,7 +320,11 @@ Source-driven development, code simplification, debugging discipline, and git wo
 Spec-Driven Development patterns — declarative constitution (`.dw/constitution.md`), cross-artifact consistency check (PRD ↔ TechSpec ↔ Tasks), and template override layer (`.dw/templates/overrides/`) — adapted from [`github/spec-kit`](https://github.com/github/spec-kit) by GitHub (MIT). dev-workflow specifics: embedded into existing commands instead of new slash commands, severity-graded enforcement (`info`/`high`/`critical`) with ADR-justified deviation as the escape hatch, ausência-of-constitution never blocks (auto-installs defaults and continues), and integration with the analytical `.dw/rules/` already produced by `/dw-analyze-project`.
-UI discipline (hard-gate, 14 anti-slop patterns, 17 anti-defaults, WCAG 2.2 AA floor) and testing doctrine (Six Iron Laws, 25 anti-patterns across 5 families, 7 AI agent gates, flaky discipline) adapted from [`pedronauck/skills`](https://github.com/pedronauck/skills) `ui-craft` and `testing-boss` (MIT) into the bundled `dw-ui-discipline` and `dw-testing-discipline` skills. dev-workflow specifics: 10 curated palette/font defaults bootstrap discipline when no design authority exists; Playwright recipes from earlier `webapp-testing` migrate into `dw-testing-discipline/references/playwright-recipes.md`; both skills wire into 11 commands across the pipeline.
+UI discipline (`dw-ui-discipline`) and testing doctrine (`dw-testing-discipline`) are original works in this repository. Earlier dev-workflow versions (≤0.13.x) drew on `pedronauck/skills` `ui-craft` and `testing-boss` as inspiration, but in v0.14.0 those skills were rewritten clean-room after a license audit confirmed that the upstream repo has no explicit LICENSE file at root — the README's MIT claim is unverified. The underlying ideas (grounding before design; behavior over mocks; mutation over coverage) are widely-documented general software engineering principles available in many sources (Beck, Fowler, Meszaros, Feathers, Google SRE Book, WCAG specifications). Browser-DevTools patterns from [`addyosmani/agent-skills`](https://github.com/addyosmani/agent-skills) (MIT) live inside `dw-testing-discipline/references/` (`security-boundary.md`, `three-workflow-patterns.md`).
+Incident response (`dw-incident-response`) adapted from [`wilsto/claude-code-starter-kit/incident-response`](https://github.com/wilsto/claude-code-starter-kit) (MIT). The 5-phase workflow structure and runbook templates come from there. wilsto credits the upstream `wshobson/agents` plugin `incident-response` (v1.3.0); attribution chain preserved. Additional reading cited in the skill: Google SRE Book, Etsy Debriefing Facilitation Guide, PagerDuty Incident Response Documentation.
+LLM evaluation (`dw-llm-eval`) trajectory-match modes (strict / unordered / subset / superset) and tool-argument matching strategies adapted from [`langchain-ai/agentevals`](https://github.com/langchain-ai/agentevals) (MIT). The broader oracle-ladder framing, judge-calibration discipline, and reference-dataset principle are distilled from the open evaluations literature (OpenAI evals cookbook, Anthropic evals guidance, the academic eval-of-LLM body of work) and rewritten in our voice.
 ## Migration from v0.12.x

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@brunosps00/dev-workflow",
-  "version": "0.13.0",
+  "version": "0.15.0",
   "description": "AI-driven development workflow commands for any project. Scaffolds a complete PRD-to-PR pipeline with multi-platform AI assistant support.",
   "bin": {
     "dev-workflow": "./bin/dev-workflow.js"

package/scaffold/en/commands/dw-bugfix.md CHANGED Viewed

@@ -18,7 +18,8 @@
     - `dw-debug-protocol`: **ALWAYS** — runs the bug through the six-step triage (Reproduce → Localize → Reduce → Fix Root Cause → Guard → Verify End-to-End). Stop-the-line discipline; root-cause over symptom; regression test committed in the same atomic commit. Non-reproducible bugs follow the instrument-first sub-protocol — no guess fixes without explicit acknowledgement.
     - `dw-verify`: **ALWAYS** — in Direct mode, invoked before committing the fix. The VERIFICATION REPORT must show the original bug symptom no longer reproduces (not just that tests pass).
     - `vercel-react-best-practices`: use when the bug affects React/Next.js and there is suspicion of render, hydration, fetching, waterfall, bundle, or re-render issues
-    - `dw-testing-discipline`: use when the fix requires a reproducible E2E/retest flow in a web app — `references/playwright-recipes.md` for recipes, Iron Laws + 7 AI Gates for any test the fix adds, flaky-discipline if the bug surfaces intermittently.
+    - `dw-testing-discipline`: use when the fix requires a reproducible E2E/retest flow in a web app — `references/playwright-recipes.md` for recipes, core rules + 6 agent guardrails for any test the fix adds, flaky-discipline if the bug surfaces intermittently.
+    - `dw-incident-response`: use when the bug has severity `critical` AND affects production AND was detected by alert/user-report (i.e., the bug IS an incident, not a backlog item). Triggers the 5-phase workflow (triage → investigation → resolution → communication → postmortem) with structured output in `.dw/incidents/`. Fixes ride on `/dw-bugfix` per the incident's resolution phase.
     - `security-review`: use when the root cause touches auth, authorization, external input, upload, secrets, SQL, XSS, SSRF, or other sensitive surfaces
     ## Input Variables

package/scaffold/en/commands/dw-code-review.md CHANGED Viewed

@@ -27,6 +27,7 @@ When available in the project under `./.agents/skills/`, use these skills as ana
 - `dw-simplification`: use when the diff touches dense or twisty code — applies Chesterton's Fence (understand WHY before flagging removal), behavior-preserving refactor protocol (test gate before/after), and complexity metrics (cyclomatic, cognitive, depth, fan-out) so that "simplify this" findings are concrete, not vibes-based.
 - `security-review`: use when auth, authorization, external input, upload, SQL, external integration, secrets, SSRF, XSS, or sensitive surfaces are present
 - `vercel-react-best-practices`: use when the diff touches React/Next.js to review rendering, fetching, bundle, hydration, and performance patterns
+- `dw-llm-eval`: **REQUIRED when the diff touches AI/LLM feature code paths** (chat handlers, RAG, classifiers, agents, prompt templates). The PR must include: (1) reference dataset path under `.dw/eval/datasets/<feature>/`, (2) at least 2 oracle rungs covering the feature, lower rungs FIRST (rung 1-3 before rung 4), (3) judge-calibration evidence if rung 4 is used (Spearman ≥0.80 against humans), (4) eval run results on the touched code path. Missing any of these → **REJECTED**.
 ## Codebase Intelligence

package/scaffold/en/commands/dw-create-tasks.md CHANGED Viewed

@@ -8,6 +8,12 @@
     ## Pipeline Position
     **Predecessor:** `/dw-create-techspec` | **Successor:** `/dw-run-task` or `/dw-run-plan`
+    ## Complementary Skills
+    When available in the project under `./.agents/skills/`, use these skills as planning support:
+    - `dw-llm-eval`: **REQUIRED when the PRD describes an AI / LLM feature** (chat, RAG, summarization, classifier, agent, tool-use, structured extraction). Add a mandatory "Eval Plan" subtask to one of the generated tasks — the subtask defines (a) the reference dataset path, (b) which oracle rungs (1-5) apply, (c) judge-calibration evidence if rung 4 is used, (d) target metrics per rung. Failing to add an eval-plan subtask for an AI feature is caught by the final consistency check.
     ## Prerequisites
     The feature you will work on is identified by this slug:

package/scaffold/en/commands/dw-deps-audit.md CHANGED Viewed

@@ -31,7 +31,7 @@ This command is **distinct** from `/dw-security-check`:
 | `security-review` (`references/supply-chain.md`) | **ALWAYS** when classifying findings — gives OWASP A06 (Vulnerable & Outdated Components) framing for the brainstorm trade-offs |
 | `dw-source-grounding` | **ALWAYS** in the brainstorm phase — each per-package update option (Conservative/Balanced/Bold) cites the official changelog/release notes for the target version: `[source: <url>, version: X.Y, retrieved: YYYY-MM-DD]`. Catches "agent recommends v5 because it sounds modern, but v5 dropped Node 18 support" errors. |
 | `dw-council` | Auto opt-in when ≥3 packages land in tier COMPROMISED — multi-advisor stress-test on remediation order and scope |
-| `dw-testing-discipline` | Optional — when the scoped test phase needs Playwright recipes for frontend projects. Iron Laws + anti-patterns apply to any test added during the audit. |
+| `dw-testing-discipline` | Optional — when the scoped test phase needs Playwright recipes for frontend projects. core rules + anti-patterns apply to any test added during the audit. |
 ## Input Variables

package/scaffold/en/commands/dw-fix-qa.md CHANGED Viewed

@@ -20,7 +20,7 @@ When available in the project under `./.agents/skills/`, use these skills as ope
 - `dw-debug-protocol`: **ALWAYS** — every bug-shaped finding (failing scenario, not missing feature) flows through the six-step triage. The retest evidence is the step-6 verification artifact; the regression test added in step 5 is what allows `Fixed` status to stick.
 - `dw-verify`: **ALWAYS** — invoked before marking any bug as `Fixed` or `Closed` in `QA/bugs.md`. Without a VERIFICATION REPORT PASS (test + lint + build) **and** retest evidence (screenshot in UI mode OR JSONL log line in API mode), status stays `Reopened` or `Under review`.
-- `dw-testing-discipline`: (UI mode) consult `references/playwright-recipes.md` for retest structures, captures, scripts. Apply Iron Laws + flaky discipline when retesting bug fixes — quarantine and SLOs from the doctrine apply.
+- `dw-testing-discipline`: (UI mode) consult `references/playwright-recipes.md` for retest structures, captures, scripts. Apply core rules + flaky discipline when retesting bug fixes — quarantine and SLOs from the doctrine apply.
 - `vercel-react-best-practices`: (UI mode) use only if the fix affects React/Next.js frontend and there is risk of rendering, hydration, fetching, or performance regression
 - `api-testing-recipes`: **(API mode — ALWAYS)** source of the recipe used at QA time. Re-execute the original `.http`/pytest/supertest/etc. file for the bug's RF; append the retest result to a fresh JSONL log under `QA/logs/api/BUG-NN-retest.log`

package/scaffold/en/commands/dw-functional-doc.md CHANGED Viewed

@@ -55,7 +55,7 @@ Works best with project analyzed by `/dw-analyze-project`
 When available in the project under `./.agents/skills/`, use these skills as operational support without replacing this command as source of truth:
-- `dw-testing-discipline`: support for structuring E2E flows (`references/playwright-recipes.md`), evidence collection patterns, and applying Iron Laws + selector hierarchy to any test the doc references
+- `dw-testing-discipline`: support for structuring E2E flows (`references/playwright-recipes.md`), evidence collection patterns, and applying core rules + selector hierarchy to any test the doc references
 - `remotion-best-practices`: mandatory support when there is a final human video, captions, composition, transitions, FFmpeg, or Remotion
 - `humanizer`: mandatory support for reviewing and naturalizing all captions, `.srt` files, descriptive texts, and any human-facing writing before final delivery
 - `dw-ui-discipline`: use when documenting visual patterns — the state matrix and scene sentence become part of each screen's overview section

package/scaffold/en/commands/dw-help.md CHANGED Viewed

@@ -185,7 +185,7 @@ Skills in `.agents/skills/` that commands above invoke transparently. You don't
 | Skill | Invoked by | Role |
 |-------|------------|------|
-| `dw-verify` | run-task, run-plan, fix-qa, bugfix, code-review, generate-pr, quick | Iron Law: no success claim without a PASS VERIFICATION REPORT |
+| `dw-verify` | run-task, run-plan, fix-qa, bugfix, code-review, generate-pr, quick | core rule: no success claim without a PASS VERIFICATION REPORT |
 | `dw-memory` | run-task, run-plan, autopilot, resume, revert-task | Two-tier workflow memory (shared + task-local) with promotion test |
 | `dw-review-rigor` | code-review, review-implementation, refactoring-analysis | De-duplication, severity ordering, verify-intent-before-flag, signal-over-volume |
 | `dw-council` | brainstorm `--council`, create-techspec `--council` | Multi-advisor debate (3-5 archetypes) with steel-manning, concession tracking, and dissent-preserving synthesis. Opt-in. |

package/scaffold/en/commands/dw-redesign-ui.md CHANGED Viewed

@@ -42,7 +42,7 @@ When available in the project under `./.agents/skills/`, use these to guide the
 - `dw-ui-discipline`: **REQUIRED** — runs the 4-checkpoint hard-gate (brand authorities OR curated defaults; surface job sentence; complete state matrix; scene sentence) BEFORE any design proposal. The 14 anti-slop patterns are checked against each proposed direction. The WCAG 2.2 AA floor is non-negotiable at the validate step.
 - `vercel-react-best-practices`: use when the project is React/Next.js for performance and implementation patterns
-- `dw-testing-discipline`: consult `references/playwright-recipes.md` for before/after screenshot capture and visual validation. Iron Laws + selector hierarchy apply to any tests generated alongside the redesign.
+- `dw-testing-discipline`: consult `references/playwright-recipes.md` for before/after screenshot capture and visual validation. core rules + selector hierarchy apply to any tests generated alongside the redesign.
 - `security-review`: use if the redesign touches authentication flows or sensitive forms
 ## Analysis Tools

package/scaffold/en/commands/dw-run-qa.md CHANGED Viewed

@@ -20,10 +20,11 @@ You are an AI assistant specialized in Quality Assurance. Your task is to valida
 When available in the project under `./.agents/skills/`, use these skills as operational support without replacing this command:
-- `dw-testing-discipline`: (UI mode) **ALWAYS** — Iron Laws and 25 anti-patterns apply to every QA test authored. `references/playwright-recipes.md` for tactical patterns. `references/three-workflow-patterns.md` to pick the right verification mode (UI / network / perf). `references/security-boundary.md` for any flow that crosses an auth boundary.
+- `dw-testing-discipline`: (UI mode) **ALWAYS** — core rules and 25 anti-patterns apply to every QA test authored. `references/playwright-recipes.md` for tactical patterns. `references/three-workflow-patterns.md` to pick the right verification mode (UI / network / perf). `references/security-boundary.md` for any flow that crosses an auth boundary.
 - `vercel-react-best-practices`: (UI mode) use only if the frontend under test is React/Next.js and there is indication of regression related to rendering, fetching, hydration, or perceived performance
 - `dw-ui-discipline`: (UI mode) use when validating design consistency — the anti-slop catalog and WCAG accessibility floor are checked as part of QA evidence
 - `api-testing-recipes`: **(API mode — ALWAYS)** validated snippets for `.http`, pytest+httpx, supertest, WebApplicationFactory, reqwest. Composes per-RF test files in `QA/scripts/api/` and JSONL logs in `QA/logs/api/` per its references
+- `dw-llm-eval`: **(AI mode — when invoked with `--ai`)** runs the reference dataset at `.dw/eval/datasets/<feature>/` against the current implementation. Computes precision@k / faithfulness / outcome accuracy per the feature type. Logs results as JSONL to `QA/logs/ai/<feature>-<date>.jsonl`. Compares against the prior run to detect regression; alerts when any metric drops >10% from baseline.
 ## Analysis Tools

package/scaffold/en/commands/dw-run-task.md CHANGED Viewed

@@ -21,7 +21,7 @@ When available in the project at `./.agents/skills/`, use these skills as specia
 | `dw-verify` | **ALWAYS** — invoked before the commit to produce a Verification Report with fresh evidence |
 | `dw-memory` | **ALWAYS** — reads workflow memory at task start and updates it at task end (promotion test) |
 | `vercel-react-best-practices` | Task touches React rendering, hydration, data fetching, bundle, cache, or performance |
-| `dw-testing-discipline` | Task needs tests (any layer) — applies Iron Laws, 7 AI Gates, anti-patterns catalog. Use `references/playwright-recipes.md` when the task has interactive frontend needing E2E validation. |
+| `dw-testing-discipline` | Task needs tests (any layer) — applies core rules, 6 agent guardrails, anti-patterns catalog. Use `references/playwright-recipes.md` when the task has interactive frontend needing E2E validation. |
 ## Codebase Intelligence

package/scaffold/pt-br/commands/dw-bugfix.md CHANGED Viewed

@@ -18,7 +18,8 @@
     - `dw-debug-protocol`: **SEMPRE** — conduz o bug pelo six-step triage (Reproduzir → Localizar → Reduzir → Fix Root Cause → Guardar → Verificar End-to-End). Stop-the-line discipline; root-cause sobre symptom; regression test commitado no mesmo commit atômico. Bugs não-reprodutíveis seguem o sub-protocolo instrument-first — sem fix por palpite a não ser com acknowledgement explícito.
     - `dw-verify`: **SEMPRE** — em modo Direto, invocada antes do commit da correção. O VERIFICATION REPORT deve mostrar que o sintoma original do bug não se reproduz mais (não apenas que os testes passam).
     - `vercel-react-best-practices`: use quando o bug afeta React/Next.js e há suspeita de problemas de render, hidratação, fetching, waterfall, bundle ou re-render
-    - `dw-testing-discipline`: use quando a correção requer fluxo E2E/reteste reproduzível em web app — `references/playwright-recipes.md` pra recipes, Iron Laws + 7 AI Gates pra qualquer teste que o fix adicione, flaky-discipline se o bug aparece de forma intermitente.
+    - `dw-testing-discipline`: use quando a correção requer fluxo E2E/reteste reproduzível em web app — `references/playwright-recipes.md` pra recipes, core rules + 6 agent guardrails pra qualquer teste que o fix adicione, flaky-discipline se o bug aparece de forma intermitente.
+    - `dw-incident-response`: use quando o bug tem severidade `critical` E afeta produção E foi detectado por alerta/user-report (ou seja, o bug É um incident, não item de backlog). Dispara o workflow de 5 fases (triage → investigation → resolution → communication → postmortem) com saída estruturada em `.dw/incidents/`. As correções rodam via `/dw-bugfix` durante a fase de resolution.
     - `security-review`: use quando a causa raiz toca auth, autorização, input externo, upload, secrets, SQL, XSS, SSRF ou outras superfícies sensíveis
     ## Variáveis de Entrada

package/scaffold/pt-br/commands/dw-code-review.md CHANGED Viewed

@@ -27,6 +27,7 @@ Quando disponíveis no projeto em `./.agents/skills/`, use estas skills como apo
 - `dw-simplification`: use quando o diff toca código denso ou tortuoso — aplica Chesterton's Fence (entender POR QUÊ antes de propor remoção), protocolo de refactor preservando comportamento (test gate antes/depois) e métricas de complexidade (ciclomática, cognitiva, depth, fan-out) para que findings de "simplifique isso" sejam concretos, não opinativos.
 - `security-review`: use quando auth, autorização, input externo, upload, SQL, integração externa, secrets, SSRF, XSS ou superfícies sensíveis estiverem presentes
 - `vercel-react-best-practices`: use quando o diff tocar React/Next.js para revisar padrões de renderização, fetching, bundle, hidratação e performance
+- `dw-llm-eval`: **OBRIGATÓRIO quando o diff toca código de feature AI/LLM** (handlers de chat, RAG, classifiers, agentes, templates de prompt). O PR deve incluir: (1) caminho do reference dataset em `.dw/eval/datasets/<feature>/`, (2) no mínimo 2 oracle rungs cobrindo a feature, rungs mais baixos PRIMEIRO (rung 1-3 antes de rung 4), (3) evidência de calibração do juiz se rung 4 for usado (Spearman ≥0.80 vs humanos), (4) resultados de eval run no path tocado. Faltando algum → **REPROVADO**.
 ## Inteligência do Codebase

package/scaffold/pt-br/commands/dw-create-tasks.md CHANGED Viewed

@@ -8,6 +8,12 @@
     ## Posição no Pipeline
     **Antecessor:** `/dw-create-techspec` | **Sucessor:** `/dw-run-task` ou `/dw-run-plan`
+    ## Skills Complementares
+    Quando disponíveis no projeto em `./.agents/skills/`, use estas skills como apoio de planejamento:
+    - `dw-llm-eval`: **OBRIGATÓRIO quando o PRD descreve uma feature AI / LLM** (chat, RAG, summarização, classifier, agente, tool-use, extração estruturada). Adicione uma subtask "Plano de Avaliação" obrigatória em uma das tasks geradas — a subtask define (a) caminho do reference dataset, (b) quais oracle rungs (1-5) se aplicam, (c) evidência de calibração do juiz se rung 4 for usado, (d) métricas-alvo por rung. Não adicionar subtask de eval-plan pra feature AI é pego pelo final consistency check.
     ## Pré-requisitos
     A funcionalidade em que você trabalhará é identificada por este slug:

package/scaffold/pt-br/commands/dw-deps-audit.md CHANGED Viewed

@@ -31,7 +31,7 @@ Este comando e **distinto** do `/dw-security-check`:
 | `security-review` (`references/supply-chain.md`) | **SEMPRE** ao classificar findings — da o framing OWASP A06 (Vulnerable & Outdated Components) para os trade-offs do brainstorm |
 | `dw-source-grounding` | **SEMPRE** na fase de brainstorm — cada opcao de update por pacote (Conservadora/Balanceada/Ousada) cita o changelog/release notes oficial da versao alvo: `[source: <url>, version: X.Y, retrieved: YYYY-MM-DD]`. Previne "agent recomenda v5 porque parece moderno, mas v5 dropou Node 18". |
 | `dw-council` | Opt-in automatico quando >=3 pacotes caem em tier COMPROMISED — stress-test multi-conselheiro sobre ordem e escopo de remediacao |
-| `dw-testing-discipline` | Opcional — quando a fase de testes escopados precisa de recipes Playwright pra projetos frontend. Iron Laws + anti-patterns valem pra qualquer teste adicionado durante o audit. |
+| `dw-testing-discipline` | Opcional — quando a fase de testes escopados precisa de recipes Playwright pra projetos frontend. core rules + anti-patterns valem pra qualquer teste adicionado durante o audit. |
 ## Variaveis de Entrada

package/scaffold/pt-br/commands/dw-fix-qa.md CHANGED Viewed

@@ -20,7 +20,7 @@ Quando disponíveis no projeto em `./.agents/skills/`, use estas skills como sup
 - `dw-debug-protocol`: **SEMPRE** — todo finding bug-shaped (cenário falhando, não feature ausente) passa pelo six-step triage. A evidência de reteste é o artefato da etapa 6 (verify); o regression test da etapa 5 é o que sustenta o status `Corrigido`.
 - `dw-verify`: **SEMPRE** — invocada antes de marcar qualquer bug como `Corrigido` ou `Fechado` no `QA/bugs.md`. Sem VERIFICATION REPORT PASS (test + lint + build) + evidência de reteste (screenshot em modo UI OU linha JSONL em modo API), o status permanece `Reaberto` ou `Em análise`.
-- `dw-testing-discipline`: (modo UI) consulte `references/playwright-recipes.md` para estruturas de reteste, capturas, scripts. Aplique Iron Laws + flaky discipline ao retestar fixes — quarantine e SLOs da doutrina valem aqui.
+- `dw-testing-discipline`: (modo UI) consulte `references/playwright-recipes.md` para estruturas de reteste, capturas, scripts. Aplique core rules + flaky discipline ao retestar fixes — quarantine e SLOs da doutrina valem aqui.
 - `vercel-react-best-practices`: (modo UI) use apenas se a correção afetar frontend React/Next.js e houver risco de regressão de renderização, hidratação, fetching ou performance
 - `api-testing-recipes`: **(modo API — SEMPRE)** fonte da recipe usada no QA. Re-execute o arquivo `.http`/pytest/supertest/etc. original do RF do bug; anexe o resultado do reteste a um log JSONL fresco em `QA/logs/api/BUG-NN-retest.log`

package/scaffold/pt-br/commands/dw-functional-doc.md CHANGED Viewed

@@ -55,7 +55,7 @@ Funciona melhor com projeto analisado por `/dw-analyze-project`
 Quando disponíveis no projeto em `./.agents/skills/`, use estas skills como apoio operacional, sem substituir este comando como fonte de verdade:
-- `dw-testing-discipline`: apoio para estruturar fluxos E2E (`references/playwright-recipes.md`), padrões de coleta de evidência, e aplicar Iron Laws + hierarquia de seletores em qualquer teste referenciado pelo doc
+- `dw-testing-discipline`: apoio para estruturar fluxos E2E (`references/playwright-recipes.md`), padrões de coleta de evidência, e aplicar core rules + hierarquia de seletores em qualquer teste referenciado pelo doc
 - `remotion-best-practices`: apoio obrigatório quando houver vídeo humano final, legendas, composição, transições, FFmpeg ou Remotion
 - `humanizer`: apoio obrigatório para revisar e naturalizar todas as legendas, captions `.srt`, textos descritivos e qualquer redação voltada a leitura humana antes da entrega final
 - `dw-ui-discipline`: use ao documentar padrões visuais — state matrix e scene sentence viram parte da seção de overview de cada tela

package/scaffold/pt-br/commands/dw-help.md CHANGED Viewed

@@ -168,7 +168,7 @@ Skills em `.agents/skills/` que os commands acima invocam transparentemente. Voc
 | Skill | Invocada por | Papel |
 |-------|--------------|-------|
-| `dw-verify` | run-task, run-plan, fix-qa, bugfix, code-review, generate-pr, quick | Iron Law: nenhuma claim de sucesso sem VERIFICATION REPORT PASS |
+| `dw-verify` | run-task, run-plan, fix-qa, bugfix, code-review, generate-pr, quick | core rule: nenhuma claim de sucesso sem VERIFICATION REPORT PASS |
 | `dw-memory` | run-task, run-plan, autopilot, resume, revert-task | Memory de workflow em dois níveis (shared + task-local) com promotion test |
 | `dw-review-rigor` | code-review, review-implementation, refactoring-analysis | De-duplication, severity ordering, verify-intent-before-flag, signal-over-volume |
 | `dw-council` | brainstorm `--council`, create-techspec `--council` | Debate multi-advisor (3-5 archetypes) com steel-manning, concession tracking e synthesis que preserva dissent. Opt-in. |

package/scaffold/pt-br/commands/dw-redesign-ui.md CHANGED Viewed

@@ -42,7 +42,7 @@ Quando disponíveis no projeto em `./.agents/skills/`, use para guiar o redesign
 - `dw-ui-discipline`: **OBRIGATÓRIO** — roda o hard-gate de 4 checkpoints (brand authorities OU curated defaults; surface job sentence; state matrix completa; scene sentence) ANTES de qualquer proposta. Os 14 anti-slop patterns são checados contra cada direção. O WCAG 2.2 AA floor é não-negociável no step de validate.
 - `vercel-react-best-practices`: use quando o projeto for React/Next.js para padrões de performance e implementação
-- `dw-testing-discipline`: consulte `references/playwright-recipes.md` para captura de screenshots antes/depois e validação visual. Iron Laws + hierarquia de seletores valem pra qualquer teste gerado junto com o redesign.
+- `dw-testing-discipline`: consulte `references/playwright-recipes.md` para captura de screenshots antes/depois e validação visual. core rules + hierarquia de seletores valem pra qualquer teste gerado junto com o redesign.
 - `security-review`: use se o redesign tocar flows de autenticação ou formulários sensíveis
 ## Ferramentas de Análise

package/scaffold/pt-br/commands/dw-run-qa.md CHANGED Viewed

@@ -20,10 +20,11 @@ Você é um assistente IA especializado em Quality Assurance. Sua tarefa é vali
 Quando disponíveis no projeto em `./.agents/skills/`, use estas skills como apoio operacional sem substituir este comando:
-- `dw-testing-discipline`: (modo UI) **SEMPRE** — Iron Laws e 25 anti-patterns valem pra todo teste de QA autorado. `references/playwright-recipes.md` pra patterns táticos. `references/three-workflow-patterns.md` pra escolher o modo certo (UI / network / perf). `references/security-boundary.md` pra qualquer fluxo que cruza boundary de auth.
+- `dw-testing-discipline`: (modo UI) **SEMPRE** — core rules e 25 anti-patterns valem pra todo teste de QA autorado. `references/playwright-recipes.md` pra patterns táticos. `references/three-workflow-patterns.md` pra escolher o modo certo (UI / network / perf). `references/security-boundary.md` pra qualquer fluxo que cruza boundary de auth.
 - `vercel-react-best-practices`: (modo UI) use apenas se o frontend sob teste for React/Next.js e houver indicação de regressão relacionada a renderização, fetching, hidratação ou performance percebida
 - `dw-ui-discipline`: (modo UI) use ao validar consistência de design — o catálogo anti-slop e o floor de acessibilidade WCAG são checados como parte da evidência de QA
 - `api-testing-recipes`: **(modo API — SEMPRE)** snippets validados para `.http`, pytest+httpx, supertest, WebApplicationFactory, reqwest. Compõe um arquivo de teste por RF em `QA/scripts/api/` e logs JSONL em `QA/logs/api/` segundo seus references
+- `dw-llm-eval`: **(modo AI — quando invocado com `--ai`)** roda o reference dataset em `.dw/eval/datasets/<feature>/` contra a implementação atual. Computa precision@k / faithfulness / outcome accuracy conforme tipo da feature. Loga resultados como JSONL em `QA/logs/ai/<feature>-<date>.jsonl`. Compara contra a run anterior pra detectar regressão; alerta quando qualquer métrica cai >10% do baseline.
 ## Ferramentas de Análise

package/scaffold/pt-br/commands/dw-run-task.md CHANGED Viewed

@@ -21,7 +21,7 @@ Quando disponíveis no projeto em `./.agents/skills/`, use estas skills como sup
 | `dw-verify` | **SEMPRE** — invocada antes do commit para produzir Verification Report com evidence fresca |
 | `dw-memory` | **SEMPRE** — lê memory da workflow no início e atualiza ao final da task (promotion test) |
 | `vercel-react-best-practices` | Task envolve renderização React, hidratação, data fetching, bundle, cache ou performance |
-| `dw-testing-discipline` | Task precisa de testes (qualquer layer) — aplica Iron Laws, 7 AI Gates, catálogo de anti-patterns. Use `references/playwright-recipes.md` quando a task tem frontend interativo precisando de validação E2E. |
+| `dw-testing-discipline` | Task precisa de testes (qualquer layer) — aplica core rules, 6 agent guardrails, catálogo de anti-patterns. Use `references/playwright-recipes.md` quando a task tem frontend interativo precisando de validação E2E. |
 ## Inteligência do Codebase

package/scaffold/skills/dw-incident-response/SKILL.md ADDED Viewed

@@ -0,0 +1,164 @@
+---
+name: dw-incident-response
+description: Use when a production incident is reported, when writing a postmortem, or when an on-call handoff is needed. Five-phase guided workflow (triage → investigation → resolution → communication → postmortem) with checkpoints between phases and structured output files persisted to .dw/incidents/.
+---
+# Incident Response
+> **Inspired by** [`wilsto/claude-code-starter-kit/incident-response`](https://github.com/wilsto/claude-code-starter-kit) (MIT). Five-phase workflow structure and runbook templates adapted from that skill; specifics rewritten for dev-workflow's `.dw/` namespace and command surface.
+> wilsto credits the original `wshobson/agents` plugin `incident-response` (v1.3.0). Attribution chain: wshobson → wilsto → dev-workflow.
+## When to use
+- A production incident is declared (SEV-1 through SEV-3).
+- You need to write a postmortem after an incident.
+- You're generating or updating a runbook for a service.
+- You need an on-call handoff template.
+- `/dw-bugfix` detects severity `critical` + production marker — auto-escalates here.
+## Key concepts
+### Severity classification
+| Severity | Criteria | Response time | Example |
+|----------|----------|---------------|---------|
+| **SEV-1 (Critical)** | Service down, data loss, security breach | Immediate (page) | Payment system offline |
+| **SEV-2 (Major)** | Significant degradation, partial outage | < 30 min | API latency 10× normal |
+| **SEV-3 (Minor)** | Limited impact, workaround exists | < 4 hours | One endpoint returning 500s |
+| **SEV-4 (Low)** | Cosmetic, non-urgent | Next business day | Dashboard chart broken |
+See `references/severity-and-triage.md` for full criteria and triage commands per stack.
+### Behavioral rules
+1. **Execute phases in order** — never skip a phase.
+2. **Write output files after each phase** — they are the record of truth for the next phase.
+3. **STOP at checkpoints** — wait for user confirmation before proceeding.
+4. **Halt on failure** — if a step fails, do not continue to the next phase.
+5. **File-based context** — read previous phase outputs rather than relying on conversation memory.
+## Entry questions
+Before starting any phase:
+1. **What's happening?** Describe the incident in 1–2 sentences. What's broken and what's the user impact?
+2. **Severity?** SEV-1 / SEV-2 / SEV-3 / SEV-4 per the table above.
+3. **Mode?**
+   - **Full workflow** — all 5 phases (triage → postmortem).
+   - **Postmortem only** — incident already resolved; skip to Phase 5.
+   - **Runbook generation** — produce a runbook template for a service (no live incident).
+## The five phases
+Each phase writes to `.dw/incidents/<YYYY-MM-DD>-<slug>/`. Slug is auto-generated from incident title (kebab-case, ≤30 chars).
+### Phase 1 — Detection & Triage
+**Output:** `.dw/incidents/<date>-<slug>/01-triage.md`
+Steps:
+1. Classify severity using the table above.
+2. Assess blast radius: which services, how many users affected, revenue impact if known.
+3. Identify immediate mitigation: rollback, feature flag toggle, traffic redirect.
+See `references/severity-and-triage.md` for diagnostic commands per stack (Kubernetes, Docker, generic HTTP).
+**Checkpoint:** present triage summary. Wait for user confirmation before moving to investigation.
+### Phase 2 — Investigation & Root Cause
+**Output:** `.dw/incidents/<date>-<slug>/02-investigation.md`
+Steps:
+1. Build timeline: when did it start? What changed?
+2. Correlate signals: metrics spike + deploy + error logs.
+3. Hypothesis testing: one theory at a time; verify each before moving on.
+4. Identify root cause: not the first symptom, but the underlying assumption that broke.
+Common forensic tools:
+- `git bisect` for regressions.
+- Recent deploy log: `git log --oneline --since="24 hours ago"`.
+- For live monitoring during investigation, `/dw-debug-protocol` flaky-investigation patterns apply.
+**Checkpoint:** present root-cause hypothesis. Wait for user confirmation before applying fix.
+### Phase 3 — Resolution & Recovery
+**Output:** `.dw/incidents/<date>-<slug>/03-resolution.md`
+Steps:
+1. Apply fix: hotfix branch → fast PR (via `/dw-generate-pr`) → deploy.
+2. Verify: health checks green, error rate back to baseline.
+3. Monitor 30 minutes post-fix for SEV-1/2 to confirm stability.
+**Checkpoint:** confirm full recovery before drafting communications.
+### Phase 4 — Communication
+**Output:** `.dw/incidents/<date>-<slug>/04-communication.md`
+Two communications generated using the templates in `references/communication-templates.md`:
+- **Initial notification** (sent during the incident; updated every 30 min for SEV-1/2).
+- **Resolution notification** (sent when phase 3 confirms recovery).
+### Phase 5 — Postmortem
+**Output:** `.dw/incidents/<date>-<slug>/05-postmortem.md`
+Generate a **blameless** postmortem using `references/postmortem-template.md`. Sections:
+- Summary (2–3 sentences).
+- Timeline (per-minute events from alert to all-clear).
+- Root cause (technical, no blame).
+- Impact (users affected, revenue, error-budget consumed).
+- What went well / What went wrong.
+- Action items (owner + due date + priority — see `references/blameless-discipline.md` for the quality bar).
+**Quality bar for action items:** see `references/blameless-discipline.md`. "Improve monitoring" does NOT count. "Add Datadog SLO alert at p99 > 800ms with on-call routing by 2026-06-01, owner: @bruno" counts.
+## Required reading by context
+| Doing what | Read |
+|------------|------|
+| Live incident — triage | `references/severity-and-triage.md` |
+| Writing the postmortem | `references/postmortem-template.md` + `references/blameless-discipline.md` |
+| Drafting incident communications | `references/communication-templates.md` |
+| Generating a runbook (no live incident) | `references/runbook-templates.md` |
+| On-call handoff document | `references/runbook-templates.md` (handoff section) |
+## Common pitfalls
+Detailed in `references/blameless-discipline.md`:
+1. **Skipping triage** — jumping to debug without assessing severity/blast-radius wastes the wrong hours.
+2. **Blame culture** — postmortems focused on "who did it" hide mistakes and incidents recur.
+3. **No action items** — postmortem filed, forgotten, same incident in 3 months.
+4. **Communicating too late** — users discover the outage before the team acknowledges; trust erodes.
+## Integration with dev-workflow commands
+- `/dw-bugfix` with severity `critical` + production marker → offers to escalate here.
+- `/dw-autopilot --incident "X"` → runs this workflow end-to-end for declared incidents.
+- `/dw-analyze-project` reads `.dw/incidents/` on next execution to surface recurring failure patterns. 3+ incidents touching the same area → flag as "structural problem; needs design review" and propose constitution principles based on observed patterns.
+- `/dw-generate-pr` is the fix-deployment path during Phase 3.
+- `/dw-adr` is the right tool when the postmortem leads to a deliberate architectural change.
+## Output directory layout
+```
+.dw/incidents/
+├── 2026-05-12-checkout-payment-outage/
+│   ├── 01-triage.md
+│   ├── 02-investigation.md
+│   ├── 03-resolution.md
+│   ├── 04-communication.md
+│   └── 05-postmortem.md
+└── 2026-05-08-search-index-stale/
+    └── 05-postmortem.md     # postmortem-only mode
+```
+Files are committed to the repo alongside code — incidents are part of the project history, not ephemeral chat.
+## Why this skill exists
+dev-workflow's existing surface is "build feature → ship." Nothing covered "production broke, what now?" Teams improvised postmortems, action items got lost, and the same bug recurred. This skill closes that loop: structured response in the moment, blameless reflection after, and cross-incident learning that feeds back into the project's constitution.

package/scaffold/skills/dw-incident-response/references/blameless-discipline.md ADDED Viewed

@@ -0,0 +1,126 @@
+# Blameless discipline — the principles behind postmortems
+The postmortem template imposes structure. This reference imposes the discipline that makes the structure useful.
+## Why blameless
+Postmortems with blame produce three failure modes:
+1. **Hidden mistakes** — people learn to omit information that might reflect badly on them.
+2. **Compliance theater** — "lessons learned" becomes a ritual, not a tool.
+3. **Recurring incidents** — the same kind of failure happens 6 months later because the underlying system never changed.
+Blameless framing forces the conversation toward what's actually fixable: systems, processes, assumptions, tooling. Individual mistakes are the entry point; the question is always "why did the system make this mistake easy/likely?"
+## The 5-whys protocol
+Stack five "why" questions until you reach an assumption or design choice that's actually changeable:
+> **Symptom:** payment endpoint returned 500 for 47 minutes.
+> **Why?** The new deploy introduced a regression in the order serializer.
+> **Why?** The serializer started reading a field that the upstream API stopped sending.
+> **Why?** The upstream API changed its response format two weeks ago and we didn't know.
+> **Why?** We don't have contract tests against the upstream API.
+> **Why?** Contract tests were considered too expensive to set up; the trade-off was never revisited.
+**Root cause:** absence of contract tests for upstream APIs (a deliberate-but-stale design decision), not "the developer didn't check the response format."
+If you reach "operator error" at any why, you stopped too early. Operator error happens because a system permits it.
+## Quality bar for action items
+The single most common postmortem failure mode: action items written vaguely so they can be "done" without changing anything.
+### Bad action items
+- "Improve monitoring."
+- "Add more tests."
+- "Document the runbook."
+- "Make the system more resilient."
+- "Better communication during incidents."
+These are wishes, not actions. They can't be tracked. They will not happen.
+### Good action items
+Every action item has THREE components: owner, due date, measurable outcome.
+| Action | Owner | Due | Measurable outcome |
+|--------|-------|-----|-------------------|
+| Add Datadog SLO alert at p99 > 800ms with PagerDuty routing to the payments on-call schedule | @bruno | 2026-06-01 | Alert exists in Datadog UI and fires successfully in test |
+| Add idempotency-key header to `POST /api/orders` with 24h deduplication window | @maria | 2026-05-25 | Two identical requests with same key return the same order ID; verified by integration test |
+| Open ADR documenting decision to add circuit breaker on Stripe API calls | @bruno | 2026-05-19 | ADR exists in `.dw/spec/<prd>/adrs/`, status: Accepted |
+**Test:** can a third party verify the action item is done without asking the owner? If yes, it's good. If no, rewrite.
+## Cognitive analysis — beyond the trigger
+Two questions to push past "the code was wrong":
+### 1. Why didn't we catch this earlier?
+This is about the blind spot. The bug existed; we didn't see it. What's the gap in our:
+- Tests (unit, integration, contract, E2E)?
+- Monitoring (metric, alert, dashboard)?
+- Process (code review, deploy checks, canary)?
+- Documentation (knew but forgot, never knew)?
+Fix the blind spot, not just the bug.
+### 2. What ELSE could be hiding behind this gap?
+If contract tests against the upstream API are missing → the payment incident is one instance. What ELSE depends on that API in ways the missing tests would catch?
+The action item is to add the contract tests, not "fix the payment serializer."
+## Cross-incident learning
+`/dw-analyze-project` reads `.dw/incidents/` on subsequent runs. Patterns to watch:
+- **3+ incidents in the same module** (e.g., billing): structural problem. Open a design-review issue, not another action item.
+- **3+ incidents with the same root-cause class** (e.g., contract drift, missing idempotency): a constitution principle is needed (`/dw-adr` + add to `.dw/constitution.md`).
+- **Time clustering** (multiple incidents during the same week): possible stress on the team's review/deploy capacity — process issue, not technical.
+These patterns are MORE valuable than individual postmortems. Single incidents tell you what broke; patterns tell you what's fragile.
+## Common pitfalls (the four big ones)
+### Pitfall 1: Skipping triage
+**Symptom:** jumping straight to debugging without assessing severity and blast radius.
+**Consequence:** wrong priority — might fix a low-impact bug while a high-impact issue festers.
+**Fix:** always classify severity first. Two minutes of triage saves hours of misguided investigation.
+### Pitfall 2: Blame culture
+**Symptom:** postmortem focuses on "who did it" instead of "why did the system allow it?"
+**Consequence:** people hide mistakes; incidents recur.
+**Fix:** blameless framing. Focus on systemic fixes — better monitoring, safer deploys, guardrails, removed footguns.
+### Pitfall 3: No action items
+**Symptom:** postmortem written, filed, forgotten.
+**Consequence:** same incident in 3 months.
+**Fix:** every postmortem has concrete action items with owners, due dates, measurable outcomes (see "Quality bar" above). Track them in the same system as feature work.
+### Pitfall 4: Communicating too late
+**Symptom:** users discover the outage before the team acknowledges it.
+**Consequence:** trust erosion + support ticket flood.
+**Fix:** first communication within 15 min for SEV-1/2, even if it's "We're investigating." Status page updates every 30 min until resolved.
+## When the discipline bends
+- **Internal-only tools:** the communication discipline can be lighter (no public status page).
+- **Compliance-driven postmortems** (SOC 2, HIPAA): may require additional fields or sign-off chains beyond the template. Add them as a project-specific extension.
+- **Trivial near-misses** (caught in staging before production): consider a "lite" postmortem — 1 page with timeline + lesson + action items. Not full structure.
+In all bend cases, document the deviation in the postmortem itself. "Skipped public communication because internal-only tool" is fine; just say it.
+## Reference reading
+The discipline above is distilled from:
+- Google SRE Book — [Incident Response](https://sre.google/sre-book/managing-incidents/) and [Postmortem Culture](https://sre.google/sre-book/postmortem-culture/).
+- Etsy — [Debriefing Facilitation Guide](https://github.com/etsy/DebriefingFacilitationGuide) (the original blameless postmortem playbook).
+- PagerDuty — [Incident Response Documentation](https://response.pagerduty.com/).
+The dev-workflow version adapts these to a single-team or small-org scale; large enterprises may need more layers.