npm - @luanpdd/kit-mcp - Versions diffs - 1.8.1 → 1.10.0 - Mend

@luanpdd/kit-mcp 1.8.1 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

package/CHANGELOG.md +86 -0
package/README.md +97 -1
package/gates/golden-signals-coverage.md +133 -0
package/gates/obs-agents-mcp-supabase.md +86 -0
package/gates/obs-skills-frontmatter.md +76 -0
package/gates/omm-no-regression.md +83 -0
package/gates/postmortem-template-required.md +127 -0
package/gates/prr-checklist-coverage.md +128 -0
package/gates/skill-must-include.md +21 -19
package/kit/agents/burn-rate-forecaster.md +160 -0
package/kit/agents/golden-signals-instrumenter.md +241 -0
package/kit/agents/incident-investigator.md +245 -0
package/kit/agents/observability-instrumenter.md +200 -0
package/kit/agents/omm-auditor.md +251 -0
package/kit/agents/postmortem-writer.md +282 -0
package/kit/agents/prr-conductor.md +288 -0
package/kit/agents/slo-engineer.md +224 -0
package/kit/agents/supabase-architect.md +62 -0
package/kit/agents/supabase-auth-bootstrapper.md +17 -0
package/kit/agents/supabase-edge-fn-writer.md +124 -0
package/kit/agents/supabase-migration-writer.md +98 -0
package/kit/agents/supabase-realtime-implementer.md +23 -0
package/kit/agents/supabase-rls-writer.md +17 -0
package/kit/agents/supabase-storage-implementer.md +174 -0
package/kit/agents/toil-auditor.md +277 -0
package/kit/commands/auditar-marco.md +102 -1
package/kit/commands/auditar-observabilidade.md +103 -0
package/kit/commands/auditar-toil.md +129 -0
package/kit/commands/burn-rate-status.md +140 -0
package/kit/commands/concluir-marco.md +73 -1
package/kit/commands/definir-slo.md +108 -0
package/kit/commands/discutir-fase.md +26 -0
package/kit/commands/forense.md +83 -1
package/kit/commands/golden-signals.md +142 -0
package/kit/commands/instrumentar-fase.md +200 -0
package/kit/commands/investigar-producao.md +162 -0
package/kit/commands/observabilidade.md +116 -0
package/kit/commands/planejar-fase.md +20 -0
package/kit/commands/postmortem.md +179 -0
package/kit/commands/prr.md +205 -0
package/kit/commands/risk-budget.md +220 -0
package/kit/commands/sre.md +227 -0
package/kit/commands/verificar-trabalho.md +26 -0
package/kit/skills/_shared-observability/glossary.md +396 -0
package/kit/skills/_shared-sre/glossary.md +573 -0
package/kit/skills/blameless-postmortems/SKILL.md +340 -0
package/kit/skills/burn-rate-alerting/SKILL.md +258 -0
package/kit/skills/core-analysis-loop/SKILL.md +352 -0
package/kit/skills/distributed-tracing/SKILL.md +362 -0
package/kit/skills/eliminating-toil/SKILL.md +243 -0
package/kit/skills/event-based-slos/SKILL.md +296 -0
package/kit/skills/four-golden-signals/SKILL.md +297 -0
package/kit/skills/observability-driven-development/SKILL.md +315 -0
package/kit/skills/observability-maturity-model/SKILL.md +222 -0
package/kit/skills/opentelemetry-standard/SKILL.md +351 -0
package/kit/skills/production-readiness-review/SKILL.md +305 -0
package/kit/skills/sre-risk-management/SKILL.md +221 -0
package/kit/skills/structured-events/SKILL.md +265 -0
package/kit/skills/telemetry-pipelines/SKILL.md +259 -0
package/kit/skills/telemetry-sampling/SKILL.md +256 -0
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -6,6 +6,92 @@ Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) · Versioning:
 ## [Unreleased]
+## [1.10.0] - 2026-05-07
+Milestone v1.10 — Suíte SRE Engagement: incorpora técnicas do livro *Site Reliability Engineering: How Google Runs Production Systems* (Beyer, Jones, Petoff, Murphy — Google/O'Reilly, 2016) ao kit-mcp. 32 REQs em 6 fases (Phases 36-41), distribuídos em 3 ondas: Núcleo SRE (Phases 36-38), Integração com suítes existentes (Phases 39-40), Gates QA + docs (Phase 41). Complementa a Suíte Observabilidade v1.9.0 (publicada 2026-05-06) e a Suíte Supabase v1.8.0 — juntas formam o stack production engineering do kit.
+### Adicionado — 6 skills SRE foundationais (Phase 36)
+Cada skill é auto-contida (sem `references/`), com frontmatter `description ≤ 200 chars`, template canônico de 5 seções (Quando usar / Regras absolutas / Patterns canônicos / Anti-patterns / Ver também), e cross-refs via Markdown link relativo.
+- `_shared-sre/glossary.md` — vocabulário canônico bilíngue (PT-BR↔EN): SLI, SLO, SLA, error budget, burn rate, toil, postmortem, blameless, PRR, golden signals (latency/traffic/errors/saturation), risk continuum, MTTR, MTBF. Lista anti-patterns explícitos (alert fatigue, hero culture, SLO 99.99%+, fixed-window error budget, blame culture, mean-only latency, monitoring causes não symptoms).
+- `sre-risk-management` — risk continuum (cap 3 livro Google SRE), 99.99% wisdom (user em 99% smartphone não distingue 99.99% vs 99.999%), error budget como balanço explícito risk × innovation, "as reliable as needs to be, no more".
+- `four-golden-signals` — Latency + Traffic + Errors + Saturation (cap 6), black-box vs white-box monitoring, distinção de latência success vs error, percentis vs mean (long tail), histograms com bucketing exponencial.
+- `eliminating-toil` — definição canônica de toil (manual, repetitivo, automatizável, tático, sem valor durável, escala linear), regra ≤ 50% (cap 5), padrões de automação, distinção toil vs overhead vs grungy work.
+- `blameless-postmortems` — template canônico 9 seções (Summary, Impact, Root Causes, Trigger, Resolution, Detection, Action Items, Lessons Learned, Timeline UTC), cultura blameless (cap 15), "no postmortem left unreviewed", Wheel of Misfortune para training.
+- `production-readiness-review` — checklist PRR (cap 32) — 6 axes: System architecture, Instrumentation/Metrics/Monitoring, Emergency response, Capacity planning, Change management, Performance — com 3 modelos de engagement: Simple PRR, Early Engagement, Frameworks/SRE Platform.
+### Adicionado — 4 agents SRE core (Phase 37)
+Cada agent inclui tabela `## Compatibilidade` por IDE (Full / Partial / Offline-only), preflight detection MCP no Step 0 quando aplicável, e frontmatter `tools:` com nomes canônicos.
+- `golden-signals-instrumenter` — especialização de `observability-instrumenter` (v1.9). Recebe código de serviço/Edge Function e retorna patches OTel com Latency=histogram bucketed exponencial, Traffic=counter por endpoint × method, Errors=counter por `error.type` enum 5-15 valores fechado (NUNCA `error.message`), Saturation=gauge resource-specific identificado explicitamente.
+- `toil-auditor` — analisa repo + git log ≤ 90d + scripts shell + comandos manuais documentados em README/runbooks. Retorna `.planning/TOIL-AUDIT.md` listando candidatos a automação com priorização P0/P1/P2 e ROI = freq × tempo / esforço.
+- `postmortem-writer` — recebe `--from-investigation <id>` (continuação de `incident-investigator` v1.9 — lê `.planning/investigations/<id>.md`) ou `--incident "<descrição>"` (standalone). Gera postmortem blameless seguindo template canônico de 9 seções em `.planning/postmortems/<id>.md`.
+- `prr-conductor` — conduz Production Readiness Review para serviço/feature. Lê schema (Supabase MCP), Edge Functions code, SLOs definidos (`.planning/slos/`), audit logs. Produz `PRR-REPORT.md` scored em 6 axes com gaps e action items priorizados (P0 blocker / P1 scheduled).
+### Adicionado — 6 commands SRE (Phase 38)
+- `/sre <subcommand>` — orquestrador único (análogo a `/supabase` v1.8 e `/observabilidade` v1.9); dispatch via `Task(subagent_type=...)` com sinônimos PT/EN para os 5 comandos abaixo.
+- `/golden-signals` — invoca `golden-signals-instrumenter` para serviço/Edge Function/fase; gera `GOLDEN-SIGNALS.md` por target com instrumentação OTel pronta.
+- `/auditar-toil` — invoca `toil-auditor`; gera `.planning/TOIL-AUDIT.md`.
+- `/postmortem` — invoca `postmortem-writer`; suporta flag `--from-investigation <id>` (continuar de investigation v1.9) ou `--incident "<descrição>"` (postmortem standalone).
+- `/prr` — invoca `prr-conductor` para serviço/feature; usa flag `--service <name>` ou `--feature <description>`; gera `PRR-REPORT.md`.
+- `/risk-budget` — exibe state atual de error budget vs risk continuum, citando SLOs definidos em v1.9 (lê `.planning/slos/`); aplica skill `sre-risk-management`.
+### Adicionado — 3 audit gates novos (Phase 41)
+Markdown specs em `gates/` com `## Check` em bash 3.2-portable (macOS default):
+- `gates/golden-signals-coverage.md` (blocking, pre-verify) — verifica código de serviço/Edge Function tocado em fase tem os 4 golden signals presentes (regex sobre `histogram | counter | gauge | saturation`). Skip gracefully em projetos content-only (sem `supabase/functions/` / `src/` / `lib/`).
+- `gates/postmortem-template-required.md` (blocking, pre-conclude) — em `/concluir-marco`, bloqueia se houve incident em `.planning/investigations/` sem `.planning/postmortems/` correspondente. `Status: INCONCLUSIVE` reconhecido como exceção (sem root cause = sem aprendizado a documentar). Princípio canônico: "no postmortem left unreviewed" (cap 15).
+- `gates/prr-checklist-coverage.md` (blocking, pre-verify) — verifica que `PRR-REPORT.md` em `.planning/prr/**/*.md` cobre os 6 axes do PRR (System architecture, Instrumentation, Emergency response, Capacity planning, Change management, Performance) — pular um axe = aprovação inválida (regra absoluta da skill `production-readiness-review`).
+### Adicionado — integração com Suíte Observabilidade v1.9 (Phase 39)
+- **Skill `event-based-slos` (v1.9)** ganha bloco "Risk continuum" cross-referenciando `sre-risk-management`; explica que target SLO é escolha explícita no continuum risk × innovation, não meta arbitrária.
+- **Agent `omm-auditor` (v1.9)** consulta `toil-auditor` para Capacidade 3 (Complexidade/Tech Debt). Score OMM-3 considera % de tempo em toil pelo time. Tabela 5-row Cap 3 (`< 15%` → 5 / `15-30%` → 4 / `30-50%` → 3 / `50-60%` → 2 / `> 60%` → 1) replicada como single source of truth distribuída.
+### Adicionado — integração com Suíte Supabase v1.8 (Phase 39)
+- **`supabase-edge-fn-writer`** ganha seção "Four Golden Signals" — template canônico de Edge Function inclui histogram de latência, counter de tráfego, counter de erros por error.type enum, gauge de saturação (recurso identificado explicitamente: pg_pool / concurrency_limit / pgmq.queue_length / egress_bandwidth conforme tipo de função).
+- **`supabase-architect`** ganha menção a PRR — plano arquitetural sugere PRR antes de production; cross-ref para `production-readiness-review`. Tabela 6 axes adaptada ao contexto Supabase (single project = SPOF mitigado por branches Pro; Spend Cap; RLS git-versioned; declarative schema; load test com p99 baseline).
+- **`supabase-migration-writer`** ganha alerta sobre toil — scripts SQL repetitivos (rebuild de índices manuais, vacuums recorrentes) são candidatos a automação via pg_cron; cross-ref para `eliminating-toil`.
+- **`supabase-storage-implementer`** ganha saturation signal — uploads emitem gauge de bucket size + counter de quota near-exhaustion (thresholds 80% yellow / 95% red por plan: Free 1 GB / Pro 100 GB / Team 1 TB / Enterprise custom); cross-ref para `four-golden-signals`.
+### Mudado — lifecycle hooks no fluxo framework (Phase 40)
+Patches editoriais puramente aditivos em 3 commands de fluxo framework — frontmatter (`description`, `allowed-tools`) preservado byte-a-byte (anti-pitfall A2), workflows em `.claude/framework/workflows/*.md` continuam funcionais como antes.
+- **`/forense`** ganha bloco `<sre_integration>` que sugere chain `/postmortem` automaticamente após Core Analysis Loop fechar com root cause `VALIDATED`. Distinção fundamental: forense diagnostica (read-only, evidence-based, científico — output em `.planning/forensics/`); postmortem documenta blameless para aprendizado organizacional (cap 15 — output em `.planning/postmortems/`). 3 condições de trigger sugerido + 3 exceções explícitas de não-trigger (INT-FW-V2-01).
+- **`/concluir-marco`** ganha gate PRR opcional — quando `workflow.complete_milestone_prr_gate=true` (default `false`, opt-in até maturidade SRE), exige `PRR-REPORT.md` com status `passed` para features production-bound antes de arquivar. Status table 3-row (`passed` 6/6 axes ≥ 3/5 = arquivável / `passed-with-warnings` P1 pendente = arquivável com warnings / `failed` P0 reprovado = BLOQUEIA). Coexiste ortogonalmente com gate OMM regression v1.9 — OMM mede observability maturity, PRR mede production readiness (INT-FW-V2-02).
+- **`/auditar-marco`** invoca `/auditar-toil` automaticamente quando `workflow.audit_milestone_toil=true` (default `true`); resultado `.planning/TOIL-AUDIT.md` alimenta scoring OMM Capacidade 3 via `omm-auditor`. Loop fechado canônico: `/auditar-marco` → `/auditar-toil` → `/auditar-observabilidade` → `omm-auditor` consulta `TOIL-AUDIT.md` → `OMM-REPORT.md` inclui Cap 3 → `MILESTONE-AUDIT.md` (INT-FW-V2-03).
+### Mudado — README ganha seção "SRE Engagement suite (v1.10)"
+`README.md` adiciona nova seção entre "Observability suite (v1.9)" e o separador `---` listando 6 skills + 4 agents + 6 commands + 3 audit gates + lifecycle integration + quick start example end-to-end (PRR antes de produção → instrumentação golden signals → após incident, postmortem chain). Citação canônica ao livro Google SRE 2016 em paridade com a citação a *Observability Engineering* na seção v1.9 (QA-SRE-04).
+### Sem mudanças de API runtime
+v1.10 é **content-only por design** — zero alterações em `src/core/`, `registry.js`, `sync.js`, ou no MCP server. Stable API v1.0+ totalmente preservada. CI passa sem mudança em `.github/workflows/`. Deps budget mantido em 6/6 (zero deps novas — todo o conteúdo é Markdown).
+### Tests
+Tests existentes (115 unit + 67 integration acumulados de v1.7) continuam verde. Novos gates não têm tests dedicados (são bash em markdown, executados via `runGate` no framework de gates já testado em `test/unit/gates.test.js`). Smoke validation por gate: PASS na codebase atual (kit-mcp content-only) + FAIL em fixture sintético com gaps + PASS em fixture sintético com cobertura completa — todos os 3 gates novos validados.
+### Decisões arquiteturais
+- **Conteúdo-only milestone** — zero alterações em `src/core/`. Toda integração com fluxo framework via patches editoriais nos commands `kit/commands/{forense,concluir-marco,auditar-marco}.md` (paridade com pattern v1.9 que adicionou bloco `<observability_integration>` aos mesmos commands).
+- **Specialização sobre overlap** — `golden-signals-instrumenter` é especialização de `observability-instrumenter` (v1.9), não substituto: aquele cuida de spans/atributos canônicos, este cuida de métricas dos 4 signals; ambos podem coexistir num mesmo PR (chain canônica: `observability-instrumenter` primeiro → `golden-signals-instrumenter` segundo).
+- **Chain v1.9 → v1.10** — `incident-investigator` (v1.9) fecha Core Analysis Loop com root cause `VALIDATED` em `.planning/investigations/<id>.md`; `postmortem-writer` (v1.10) consome via `--from-investigation <id>` para gerar `.planning/postmortems/<id>.md`. Handoff é state-based via filesystem (não API).
+- **Gates blocking pre-verify** — `golden-signals-coverage` e `prr-checklist-coverage` são blocking (cobertura mínima é regra absoluta). `postmortem-template-required` é blocking pre-conclude (regra cap 15 "no postmortem left unreviewed" não admite warn-only após adoption).
+- **PRR gate em `/concluir-marco` é opt-in** — diferente do gate OMM regression v1.9 (default `true`, estabelecido), o gate PRR v1.10 é default `false` até time amadurecer cultura SRE. Toggle via `workflow.complete_milestone_prr_gate=true`. Critério de "ligar gate": ≥ 2 dos 4 indicadores (paid feature, SLO definido, on-call rotation, postmortem culture).
+- **Vendor-neutral** — gate `golden-signals-coverage` aceita qualquer pattern com `histogram` / `counter` / `gauge` (OTel, Prometheus, StatsD, Borgmon-like). Livro Google SRE descreve Borgmon mas é proprietário; gate é genérico.
+### Detalhes
+`.planning/milestones/v1.10.0/` (após `/concluir-marco`).
 ## [1.8.1] - 2026-05-06
 Patch de integração da Suíte Supabase v1.8.0 — fecha 7 lacunas onde o conteúdo novo não estava "wired" nos pontos de entrada existentes do framework.

package/README.md CHANGED Viewed

@@ -59,10 +59,106 @@ kit-mcp/
 ### About the bundled workflow
-The bundled `kit/` is an opinionated **brownfield planning workflow** in Portuguese — milestones, phases, requirements, planning, execution with atomic commits and checkpoints, retrospective auditing. Installing `@luanpdd/kit-mcp` and syncing into your IDE gives you all 60 slash-commands, 19 agents, plus the framework templates that they delegate into.
+The bundled `kit/` is an opinionated **brownfield planning workflow** in Portuguese — milestones, phases, requirements, planning, execution with atomic commits and checkpoints, retrospective auditing. Installing `@luanpdd/kit-mcp` and syncing into your IDE gives you all 60+ slash-commands, 24+ agents, plus the framework templates that they delegate into.
 If that's not what you want, point `--kit-root` at your own folder and ignore everything under `kit/` — the infrastructure (registry, sync, gates, forensics, MCP server) works the same regardless of what kit you load.
+### Observability suite (v1.9)
+A complete observability layer derived from *Observability Engineering* (Charity Majors, Liz Fong-Jones, George Miranda — O'Reilly, 2022) ships in the kit. It integrates deeply with the Supabase suite (v1.8) — every Supabase agent now consults observability skills, and the new `incident-investigator` agent uses `mcp__supabase__get_logs` / `execute_sql` / `get_advisors` to apply the **Core Analysis Loop** on real incidents.
+**11 skills** in `kit/skills/`:
+- `_shared-observability/glossary.md` — canonical bilingual vocabulary (PT-BR↔EN)
+- `structured-events`, `distributed-tracing`, `opentelemetry-standard`, `core-analysis-loop` — foundationals
+- `observability-driven-development` — the 4 pre-PR questions ("Does it do what I expected? Compare to previous version? Are users using? Anomalies emerge?")
+- `event-based-slos`, `burn-rate-alerting` — SLO definition + predictive burn alerts
+- `telemetry-sampling`, `telemetry-pipelines`, `observability-maturity-model` — scale + culture
+**5 agents** in `kit/agents/`:
+- `observability-instrumenter` — generates OTel + canonical attribute patches
+- `incident-investigator` — Core Analysis Loop with persistent state in `.planning/investigations/`
+- `slo-engineer` — generates `SLO.md` + SQL migrations to materialize SLI events
+- `burn-rate-forecaster` — calculates burn rate, ETA exhaustion, alert config
+- `omm-auditor` — scores 5 OMM capabilities (resilience, code quality, complexity, release cadence, user behavior)
+**6 commands**:
+- `/observabilidade <subcommand>` — single orchestrator (analog to `/supabase`) — dispatches to the 5 agents above with PT/EN synonyms
+- `/instrumentar-fase` — generates `INSTRUMENTATION.md` per plan after `/planejar-fase`
+- `/investigar-producao` — guided Core Analysis Loop with persistent state
+- `/definir-slo` — creates SLO definition + SQL materialized view
+- `/burn-rate-status` — table `[SLO | budget burned | ETA | action]`, also runnable in `/loop`
+- `/auditar-observabilidade` — generates OMM-REPORT.md scored
+**Quick start example:**
+```bash
+# Define an SLO for a critical journey
+/observabilidade slo "checkout"
+# Investigate a production incident with Core Analysis Loop
+/observabilidade investigar "checkout SLO burn rate = 8 às 14:32"
+# Score project against Observability Maturity Model
+/observabilidade omm
+```
+### SRE Engagement suite (v1.10)
+A production engineering layer derived from *Site Reliability Engineering: How Google Runs Production Systems* (Beyer, Jones, Petoff, Murphy — Google/O'Reilly, 2016) ships in the kit. It composes with the Supabase suite (v1.8) and the Observability suite (v1.9) into a coherent production engineering stack — Supabase agents now suggest PRR before launch, every Edge Function template includes the **4 golden signals**, and `incident-investigator` outputs feed directly into blameless postmortems via `/postmortem --from-investigation <id>`.
+**6 skills** in `kit/skills/`:
+- `_shared-sre/glossary.md` — canonical bilingual vocabulary (PT-BR↔EN) — SLI/SLO/SLA, error budget, burn rate, toil, postmortem, blameless, PRR, golden signals, risk continuum, MTTR/MTBF
+- `sre-risk-management` — risk continuum (cap 3), 99.99% wisdom ("as reliable as needs to be, no more"), error budget as explicit risk × innovation balance
+- `four-golden-signals` — Latency + Traffic + Errors + Saturation (cap 6), histograms with exponential bucketing, success vs error latency separated, percentiles vs mean (long tail)
+- `eliminating-toil` — canonical toil definition (manual + repetitive + automatable + tactical + no enduring value + scales linearly), ≤ 50% rule (cap 5), automation patterns
+- `blameless-postmortems` — canonical 9-section template (cap 15), "no postmortem left unreviewed", blame culture as anti-pattern, Wheel of Misfortune
+- `production-readiness-review` — PRR checklist (cap 32) — 6 axes (System architecture, Instrumentation, Emergency response, Capacity planning, Change management, Performance), 3 engagement models
+**4 agents** in `kit/agents/`:
+- `golden-signals-instrumenter` — specialization of `observability-instrumenter` (v1.9); generates OTel patches with the 4 golden signals (Latency=histogram, Traffic=counter, Errors=counter by `error.type`, Saturation=gauge)
+- `toil-auditor` — analyzes git log + shell scripts + manual commands in README/runbooks; produces `TOIL-AUDIT.md` with P0/P1/P2 priority + estimated effort
+- `postmortem-writer` — natural continuation of `incident-investigator` (v1.9); reads `.planning/investigations/<id>.md` and produces blameless postmortem (Summary, Impact, Root Causes, Trigger, Resolution, Detection, Action Items, Lessons Learned, Timeline UTC)
+- `prr-conductor` — conducts Production Readiness Review for service/feature; reads schema (Supabase MCP), Edge Functions, `.planning/slos/`, audit logs; produces `PRR-REPORT.md` scored across the 6 axes
+**6 commands**:
+- `/sre <subcommand>` — single orchestrator (analog to `/supabase` v1.8 and `/observabilidade` v1.9) — dispatches to the 4 agents with PT/EN synonyms
+- `/golden-signals` — invokes `golden-signals-instrumenter` for service/Edge Function/phase; generates `GOLDEN-SIGNALS.md` with OTel-ready instrumentation
+- `/auditar-toil` — invokes `toil-auditor`; generates `.planning/TOIL-AUDIT.md`
+- `/postmortem` — invokes `postmortem-writer`; supports `--from-investigation <id>` (continue from v1.9 investigation) or `--incident "<description>"` (standalone)
+- `/prr` — invokes `prr-conductor`; supports `--service <name>` or `--feature <description>`; generates `PRR-REPORT.md`
+- `/risk-budget` — displays current error budget vs risk continuum, citing SLOs from v1.9 (`.planning/slos/`); applies `sre-risk-management` skill
+**3 audit gates** in `gates/`:
+- `golden-signals-coverage` (blocking, pre-verify) — verifies code in `supabase/functions/**`, `src/**`, `lib/**` covers the 4 golden signals (skips gracefully on content-only phases)
+- `postmortem-template-required` (blocking, pre-conclude) — blocks `/concluir-marco` if any `.planning/investigations/<id>.md` lacks a corresponding `.planning/postmortems/<id>.md` (`Status: INCONCLUSIVE` is the only exception)
+- `prr-checklist-coverage` (blocking, pre-verify) — verifies every `PRR-REPORT.md` in `.planning/prr/**/*.md` covers the 6 canonical axes; "skipping an axe = invalid approval"
+**Lifecycle integration:**
+- `/forense` — after Core Analysis Loop closes with VALIDATED root cause, suggests chain `/postmortem --from-investigation <id>` (Phase 40 / INT-FW-V2-01)
+- `/concluir-marco` — opt-in gate `workflow.complete_milestone_prr_gate=true` requires `PRR-REPORT.md` with status `passed` for production-bound features before archive (Phase 40 / INT-FW-V2-02)
+- `/auditar-marco` — auto-invokes `/auditar-toil` when `workflow.audit_milestone_toil=true` (default); result feeds OMM Capacidade 3 scoring via `omm-auditor` (Phase 40 / INT-FW-V2-03)
+**Quick start example — end-to-end SRE workflow:**
+```bash
+# Before launching a new feature in production — PRR
+/sre prr --feature "checkout v2"
+# While instrumenting service — apply 4 golden signals
+/sre golden-signals supabase/functions/orders/index.ts
+# Audit team toil quarterly
+/sre toil
+# When SLO burn alert fires — investigate (v1.9 deep loop), then postmortem (v1.10)
+/investigar-producao "checkout SLO burn rate = 8 às 14:32"
+/sre postmortem --from-investigation checkout-2026-05-07
+# Or for framework-level failures:
+/forense "framework workflow X falhou em produção"
+/sre postmortem --incident "framework workflow X failed (see .planning/forensics/report-*)"
+# Risk dashboard against SLO budgets
+/sre risk-budget
+```
 ---
 ## Prerequisites

package/gates/golden-signals-coverage.md ADDED Viewed

@@ -0,0 +1,133 @@
+---
+id: golden-signals-coverage
+stage: pre-verify
+blocking: true
+description: Valida que código de serviço/Edge Function tocado em fase contém os 4 golden signals (Latency=histogram, Traffic=counter, Errors=counter, Saturation=gauge). Skip se fase só toca markdown.
+---
+# Golden signals coverage gate
+**When to run:** pre-verify (blocking — fase não verifica até cobertura completa).
+## Check
+```bash
+#!/usr/bin/env bash
+# PT-BR: validar que código de serviço/Edge Function tocado em fase tem 4 golden signals.
+# Estratégia: descobrir arquivos tocados (supabase/functions/** ou STATE.md current_phase code paths),
+# rodar grep por histogram/counter/gauge/saturation, contar matches por sinal.
+# Bash 3.2-portable (macOS default).
+set -e
+# PT-BR: identificar fase atual via STATE.md
+STATE_FILE=".planning/STATE.md"
+CURRENT_PHASE=""
+if [ -f "$STATE_FILE" ]; then
+  CURRENT_PHASE=$(grep -E "^Fase:" "$STATE_FILE" 2>/dev/null | head -1 | sed -E 's/^Fase: *([0-9]+).*/\1/')
+fi
+# PT-BR: candidatos a arquivos de código tocados — escopo principal Supabase Edge + qualquer .ts/.js/.py
+# em paths declarados pela fase (heurística: supabase/functions/** SEMPRE inspecionado).
+CODE_FILES=""
+if [ -d "supabase/functions" ]; then
+  CODE_FILES=$(find supabase/functions -type f \( -name "*.ts" -o -name "*.js" -o -name "*.mjs" \) 2>/dev/null)
+fi
+# PT-BR: também inspecionar lib/ e src/ se existirem (apps Node/Deno fora de Supabase)
+if [ -d "src" ]; then
+  ADDITIONAL=$(find src -type f \( -name "*.ts" -o -name "*.js" -o -name "*.mjs" -o -name "*.py" \) 2>/dev/null)
+  CODE_FILES="$CODE_FILES
+$ADDITIONAL"
+fi
+if [ -d "lib" ]; then
+  ADDITIONAL=$(find lib -type f \( -name "*.ts" -o -name "*.js" -o -name "*.mjs" -o -name "*.py" \) 2>/dev/null)
+  CODE_FILES="$CODE_FILES
+$ADDITIONAL"
+fi
+# PT-BR: filtrar linhas vazias
+CODE_FILES=$(echo "$CODE_FILES" | grep -v "^$" || true)
+# PT-BR: se fase não toca código (só markdown/docs), pular gate
+if [ -z "$CODE_FILES" ]; then
+  echo "INFO: nenhum arquivo de código (.ts/.js/.py) encontrado em supabase/functions/** | src/** | lib/** — fase parece content-only. Gate skipped."
+  exit 0
+fi
+# PT-BR: contar matches por signal
+LATENCY_HITS=0
+TRAFFIC_HITS=0
+ERRORS_HITS=0
+SATURATION_HITS=0
+# PT-BR: process file list line-by-line para portabilidade bash 3.2
+OLDIFS="$IFS"
+IFS='
+'
+for f in $CODE_FILES; do
+  [ -z "$f" ] && continue
+  [ ! -f "$f" ] && continue
+  # PT-BR: Latency = histogram (createHistogram, recordHistogram, histogram.record)
+  if grep -qE "histogram|Histogram" "$f" 2>/dev/null; then
+    LATENCY_HITS=$((LATENCY_HITS + 1))
+  fi
+  # PT-BR: Traffic + Errors = counter (Errors counter dimensionado por error.type)
+  if grep -qE "counter|Counter|createCounter" "$f" 2>/dev/null; then
+    TRAFFIC_HITS=$((TRAFFIC_HITS + 1))
+    ERRORS_HITS=$((ERRORS_HITS + 1))
+  fi
+  # PT-BR: Saturation = gauge (createObservableGauge, gauge.record) ou string saturation
+  if grep -qE "gauge|Gauge|saturation|Saturation" "$f" 2>/dev/null; then
+    SATURATION_HITS=$((SATURATION_HITS + 1))
+  fi
+done
+IFS="$OLDIFS"
+# PT-BR: gate passa se TODOS os 4 signals têm pelo menos 1 hit em algum arquivo de código
+MISSING=""
+[ "$LATENCY_HITS" -eq 0 ] && MISSING="$MISSING Latency(histogram)"
+[ "$TRAFFIC_HITS" -eq 0 ] && MISSING="$MISSING Traffic(counter)"
+[ "$ERRORS_HITS" -eq 0 ] && MISSING="$MISSING Errors(counter)"
+[ "$SATURATION_HITS" -eq 0 ] && MISSING="$MISSING Saturation(gauge)"
+if [ -z "$MISSING" ]; then
+  echo "PASS: 4 golden signals cobertos em código (Latency=$LATENCY_HITS files / Traffic=$TRAFFIC_HITS / Errors=$ERRORS_HITS / Saturation=$SATURATION_HITS)"
+  exit 0
+else
+  echo "FAIL: golden signals ausentes em código tocado:$MISSING"
+  echo "Sugestão: rodar /sre golden-signals <service> ou /golden-signals para gerar instrumentação OTel canônica."
+  echo "Cross-ref: kit/skills/four-golden-signals/SKILL.md + kit/agents/golden-signals-instrumenter.md"
+  exit 1
+fi
+```
+## Verdict
+- **passed** — todos 4 signals (Latency / Traffic / Errors / Saturation) presentes em pelo menos 1 arquivo de código no projeto
+- **passed (skip)** — projeto não tem código (apenas markdown / docs); gate não aplicável
+- **block** — pelo menos 1 signal ausente em código tocado pela fase
+## Why
+O livro Google SRE (cap 6 — *Monitoring Distributed Systems*) define os **4 golden signals** como cobertura mínima universal de saúde operacional para serviços user-facing — Latency (histogram com percentis, success vs error separados), Traffic (counter por endpoint × method), Errors (counter por `error.type` enum 5-15 valores, NUNCA `error.message`), Saturation (gauge do recurso mais escasso identificado explicitamente).
+Sem esse gate, fases entregam Edge Functions / serviços sem instrumentação básica e dashboards crescem ad-hoc (CPU, memory, threads — *causes* não *symptoms*). Gate força padrão canônico: cada PR de código deve cobrir os 4 signals, ou explicar a ausência via skip (fase só altera markdown).
+Cross-ref agent canônico: [`golden-signals-instrumenter`](../kit/agents/golden-signals-instrumenter.md) (Phase 37 / AGCORE-SRE-01). Skill: [`four-golden-signals`](../kit/skills/four-golden-signals/SKILL.md) (Phase 36 / SKFD-SRE-02).
+## REQ
+QA-SRE-01.
+## Configuração
+Gate é **blocking** por default. Para tornar warn-only (durante adoption inicial em legado):
+```bash
+node ./.claude/framework/bin/tools.cjs config-set workflow.golden_signals_coverage_warn true
+```
+(Nota: implementação do toggle warn-only é deferida — gate atual lê apenas presença/ausência de regex, não consulta config.)

package/gates/obs-agents-mcp-supabase.md ADDED Viewed

@@ -0,0 +1,86 @@
+---
+id: obs-agents-mcp-supabase
+stage: pre-verify
+blocking: true
+description: Valida que agents observability que precisam de MCP Supabase declaram tools mcp__supabase__* no frontmatter (incident-investigator, slo-engineer, burn-rate-forecaster, omm-auditor).
+---
+# Observability agents MCP Supabase declaration gate
+**When to run:** pre-verify.
+## Check
+```bash
+#!/usr/bin/env bash
+# PT-BR: agents que usam MCP Supabase devem declarar tools mcp__supabase__* no frontmatter.
+# Anti-pitfall: declaração ausente faz Claude Code não autorizar tool, agent falha em runtime.
+set -e
+VIOLATIONS=0
+# PT-BR: agents que DEVEM declarar mcp__supabase__*
+declare_required() {
+  local agent="$1"
+  local required_tools="$2"   # tools separados por |
+  local file="kit/agents/$agent.md"
+  if [ ! -f "$file" ]; then
+    echo "FAIL: $file — agent ausente"
+    VIOLATIONS=$((VIOLATIONS + 1))
+    return
+  fi
+  # PT-BR: extrair frontmatter tools field (multi-line possível)
+  local in_frontmatter=0
+  local in_tools=0
+  local tools_block=""
+  while IFS= read -r line; do
+    if [ "$line" = "---" ]; then
+      if [ "$in_frontmatter" -eq 0 ]; then
+        in_frontmatter=1
+      else
+        break
+      fi
+    elif [ "$in_frontmatter" -eq 1 ]; then
+      tools_block="$tools_block $line"
+    fi
+  done < "$file"
+  local IFS='|'
+  for tool in $required_tools; do
+    if ! echo "$tools_block" | grep -qF "$tool"; then
+      echo "FAIL: $file — não declara '$tool' em frontmatter tools"
+      VIOLATIONS=$((VIOLATIONS + 1))
+    fi
+  done
+}
+# PT-BR: incident-investigator usa get_logs/execute_sql/get_advisors
+declare_required "incident-investigator" "mcp__supabase__get_logs|mcp__supabase__execute_sql|mcp__supabase__get_advisors"
+# PT-BR: slo-engineer usa execute_sql + apply_migration
+declare_required "slo-engineer" "mcp__supabase__execute_sql|mcp__supabase__apply_migration"
+# PT-BR: burn-rate-forecaster usa execute_sql
+declare_required "burn-rate-forecaster" "mcp__supabase__execute_sql"
+# PT-BR: omm-auditor usa execute_sql (queries SLI)
+declare_required "omm-auditor" "mcp__supabase__execute_sql"
+if [ "$VIOLATIONS" -eq 0 ]; then
+  echo "PASS: 4 agents observability declaram mcp__supabase__* corretamente"
+  exit 0
+else
+  echo "FAIL: $VIOLATIONS violação(ões)"
+  exit 1
+fi
+```
+## Why
+Agents observability que aplicam Core Analysis Loop ou queries SLI dependem de `mcp__supabase__*`. Sem declaração no frontmatter `tools`, Claude Code não autoriza o tool em runtime e o agent falha (precedente: anti-pitfall identificado em v1.8 com supabase-* agents).
+## REQ
+QA-02.

package/gates/obs-skills-frontmatter.md ADDED Viewed

@@ -0,0 +1,76 @@
+---
+id: obs-skills-frontmatter
+stage: pre-verify
+blocking: true
+description: Valida que skills observability têm frontmatter completo (name + description ≤ 200 chars) e seções obrigatórias do template.
+---
+# Observability skills frontmatter gate
+**When to run:** pre-verify.
+## Check
+```bash
+#!/usr/bin/env bash
+# PT-BR: validar que cada skill em kit/skills/{structured-events,distributed-tracing,opentelemetry-standard,core-analysis-loop,observability-driven-development,event-based-slos,burn-rate-alerting,telemetry-sampling,telemetry-pipelines,observability-maturity-model}/SKILL.md
+# tem frontmatter completo + seções obrigatórias.
+# Portable bash 3.2+ (macOS default).
+set -e
+VIOLATIONS=0
+SKILLS="structured-events distributed-tracing opentelemetry-standard core-analysis-loop observability-driven-development event-based-slos burn-rate-alerting telemetry-sampling telemetry-pipelines observability-maturity-model"
+for skill in $SKILLS; do
+  file="kit/skills/$skill/SKILL.md"
+  if [ ! -f "$file" ]; then
+    echo "FAIL: $file — skill ausente"
+    VIOLATIONS=$((VIOLATIONS + 1))
+    continue
+  fi
+  # PT-BR: frontmatter name presente
+  if ! grep -qE '^name:' "$file"; then
+    echo "FAIL: $file — frontmatter 'name:' ausente"
+    VIOLATIONS=$((VIOLATIONS + 1))
+  fi
+  # PT-BR: frontmatter description presente
+  if ! grep -qE '^description:' "$file"; then
+    echo "FAIL: $file — frontmatter 'description:' ausente"
+    VIOLATIONS=$((VIOLATIONS + 1))
+  else
+    desc=$(grep -E '^description:' "$file" | head -1 | sed 's/description: //')
+    len=${#desc}
+    if [ "$len" -gt 200 ]; then
+      echo "FAIL: $file — description=$len chars (limite 200, anti-pitfall A2)"
+      VIOLATIONS=$((VIOLATIONS + 1))
+    fi
+  fi
+  # PT-BR: 4+ seções H2 (Quando usar, Regras absolutas, Patterns canônicos, Anti-patterns OU Verificação)
+  h2_count=$(grep -cE '^## ' "$file")
+  if [ "$h2_count" -lt 4 ]; then
+    echo "FAIL: $file — só $h2_count seções H2 (mínimo 4 — Quando usar, Regras absolutas, Patterns canônicos, Anti-patterns/Verificação)"
+    VIOLATIONS=$((VIOLATIONS + 1))
+  fi
+done
+if [ "$VIOLATIONS" -eq 0 ]; then
+  echo "PASS: 10 skills observability com frontmatter completo + 4+ seções H2"
+  exit 0
+else
+  echo "FAIL: $VIOLATIONS violação(ões)"
+  exit 1
+fi
+```
+## Why
+- Skills sem `description` não aparecem em `listKit` (LLM não acha o trigger)
+- `description > 200 chars` infla CLAUDE.md desnecessariamente (anti-pitfall A2)
+- Skills sem template fixo geram outputs inconsistentes — gate força padrão.
+## REQ
+QA-01.

package/gates/omm-no-regression.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+id: omm-no-regression
+stage: pre-conclude
+blocking: false
+description: Valida que nenhuma das 5 capacidades OMM regrediu vs marco anterior. Rodável em /concluir-marco. Não-bloqueante (warn) por default; configurável via workflow.omm_no_regression.
+---
+# OMM no-regression gate
+**When to run:** pre-conclude (antes de `/concluir-marco` arquivar marco).
+## Check
+```bash
+#!/usr/bin/env bash
+# PT-BR: validar que OMM-REPORT.md atual não tem capacidade regredida vs marco anterior.
+# Estratégia: comparar scores no OMM-REPORT.md atual vs último arquivado.
+set -e
+CURRENT=".planning/OMM-REPORT.md"
+if [ ! -f "$CURRENT" ]; then
+  echo "WARN: $CURRENT ausente — rodar /auditar-observabilidade primeiro. Pulando gate."
+  exit 0
+fi
+# PT-BR: encontrar OMM-REPORT.md anterior em milestones arquivados
+PREVIOUS=$(find .planning/milestones -name "OMM-REPORT.md" -type f 2>/dev/null | sort -r | head -1)
+if [ -z "$PREVIOUS" ] || [ ! -f "$PREVIOUS" ]; then
+  echo "INFO: sem OMM-REPORT anterior arquivado (primeiro marco com OMM). Pulando regression check."
+  exit 0
+fi
+# PT-BR: extrair scores do OMM-REPORT.md atual e anterior
+# Formato esperado: "| 1 | Resiliência | 3 | ... |"
+REGRESSIONS=0
+for cap in 1 2 3 4 5; do
+  current_score=$(grep -E "^\| $cap \| " "$CURRENT" 2>/dev/null | awk -F'|' '{print $4}' | tr -d ' ' | head -1)
+  previous_score=$(grep -E "^\| $cap \| " "$PREVIOUS" 2>/dev/null | awk -F'|' '{print $4}' | tr -d ' ' | head -1)
+  if [ -z "$current_score" ] || [ -z "$previous_score" ]; then
+    continue
+  fi
+  if [ "$current_score" -lt "$previous_score" ]; then
+    cap_name=$(grep -E "^\| $cap \| " "$CURRENT" | awk -F'|' '{print $3}' | xargs)
+    echo "REGRESSION: Capacidade $cap ($cap_name) regrediu de $previous_score → $current_score"
+    REGRESSIONS=$((REGRESSIONS + 1))
+  fi
+done
+if [ "$REGRESSIONS" -eq 0 ]; then
+  echo "PASS: nenhuma das 5 capacidades OMM regrediu vs $PREVIOUS"
+  exit 0
+else
+  echo "WARN: $REGRESSIONS capacidade(s) regredida(s)"
+  # PT-BR: blocking=false por default. Para tornar bloqueante:
+  #   workflow.omm_no_regression=true
+  if [ "$(node ./.claude/framework/bin/tools.cjs config-get workflow.omm_no_regression 2>/dev/null || echo false)" = "true" ]; then
+    exit 1
+  fi
+  exit 0
+fi
+```
+## Why
+OMM regression alerta o time que algo deteriorou apesar do esforço do marco. Sem este gate, regressions silenciam e accumulate como tech debt invisível.
+Default não-bloqueante para evitar ruído inicial; flag `workflow.omm_no_regression=true` opt-in quando time confiante.
+## REQ
+QA-03 + INT-FW-04 + INT-FW-05.
+## Configuração
+```bash
+# PT-BR: tornar bloqueante (recomendado depois de 2-3 marcos consecutivos sem regression)
+node ./.claude/framework/bin/tools.cjs config-set workflow.omm_no_regression true
+```