npm - mishkan-harness - Versions diffs - 0.1.0 - Mend

mishkan-harness 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (186) hide show

package/LICENSE +21 -0
package/README.md +205 -0
package/bin/mishkan.js +221 -0
package/docs/design/MISHKAN_agent_aliases.md +140 -0
package/docs/design/MISHKAN_decisions.md +172 -0
package/docs/design/MISHKAN_harness_design.md +820 -0
package/docs/design/MISHKAN_ontology.md +87 -0
package/docs/design/MISHKAN_token_optimisation.md +181 -0
package/docs/engineer/README.md +37 -0
package/docs/engineer/profile.example.md +79 -0
package/docs/usage/01-installation.md +178 -0
package/docs/usage/02-project-init.md +151 -0
package/docs/usage/03-orchestration.md +218 -0
package/docs/usage/04-memory-layer.md +201 -0
package/docs/usage/05-selective-ingest.md +177 -0
package/docs/usage/06-llm-providers.md +195 -0
package/docs/usage/07-troubleshooting.md +316 -0
package/docs/usage/08-glossary.md +154 -0
package/docs/usage/09-workflows.md +123 -0
package/docs/usage/README.md +77 -0
package/package.json +43 -0
package/payload/install/settings.hooks.json +47 -0
package/payload/mishkan/AGENT_SPEC.md +154 -0
package/payload/mishkan/agents/ahikam.md +58 -0
package/payload/mishkan/agents/aholiab.md +68 -0
package/payload/mishkan/agents/asaph.md +73 -0
package/payload/mishkan/agents/baruch.md +88 -0
package/payload/mishkan/agents/benaiah.md +76 -0
package/payload/mishkan/agents/bezalel.md +83 -0
package/payload/mishkan/agents/caleb.md +74 -0
package/payload/mishkan/agents/deborah.md +63 -0
package/payload/mishkan/agents/elasah.md +58 -0
package/payload/mishkan/agents/eliashib.md +68 -0
package/payload/mishkan/agents/ezra.md +69 -0
package/payload/mishkan/agents/hanun.md +64 -0
package/payload/mishkan/agents/hiram.md +68 -0
package/payload/mishkan/agents/hizkiah.md +76 -0
package/payload/mishkan/agents/huldah.md +59 -0
package/payload/mishkan/agents/huram.md +66 -0
package/payload/mishkan/agents/hushai.md +59 -0
package/payload/mishkan/agents/igal.md +58 -0
package/payload/mishkan/agents/ira.md +86 -0
package/payload/mishkan/agents/jahaziel.md +71 -0
package/payload/mishkan/agents/jakin.md +66 -0
package/payload/mishkan/agents/jehonathan.md +62 -0
package/payload/mishkan/agents/jehoshaphat.md +68 -0
package/payload/mishkan/agents/joab.md +71 -0
package/payload/mishkan/agents/joah.md +62 -0
package/payload/mishkan/agents/maaseiah.md +61 -0
package/payload/mishkan/agents/meremoth.md +65 -0
package/payload/mishkan/agents/meshullam.md +67 -0
package/payload/mishkan/agents/nathan.md +70 -0
package/payload/mishkan/agents/nehemiah.md +93 -0
package/payload/mishkan/agents/obed.md +60 -0
package/payload/mishkan/agents/oholiab.md +67 -0
package/payload/mishkan/agents/palal.md +63 -0
package/payload/mishkan/agents/phinehas.md +73 -0
package/payload/mishkan/agents/rehum.md +60 -0
package/payload/mishkan/agents/salma.md +69 -0
package/payload/mishkan/agents/seraiah.md +73 -0
package/payload/mishkan/agents/shallum.md +66 -0
package/payload/mishkan/agents/shaphan.md +64 -0
package/payload/mishkan/agents/shemaiah.md +67 -0
package/payload/mishkan/agents/shevna.md +58 -0
package/payload/mishkan/agents/uriah.md +70 -0
package/payload/mishkan/agents/zaccur.md +58 -0
package/payload/mishkan/agents/zadok.md +67 -0
package/payload/mishkan/agents/zerubbabel.md +69 -0
package/payload/mishkan/cognee/.env.curated.example +61 -0
package/payload/mishkan/cognee/.env.example +165 -0
package/payload/mishkan/cognee/Dockerfile +50 -0
package/payload/mishkan/cognee/README.md +129 -0
package/payload/mishkan/cognee/docker-compose.curated-ui.yml +61 -0
package/payload/mishkan/cognee/docker-compose.curated.yml +85 -0
package/payload/mishkan/cognee/docker-compose.hardening.yml +16 -0
package/payload/mishkan/cognee/docker-compose.selfhosted.yml +114 -0
package/payload/mishkan/cognee/docker-compose.ui.yml +70 -0
package/payload/mishkan/cognee/docker-compose.yml +71 -0
package/payload/mishkan/cognee/ingest-curated.py +92 -0
package/payload/mishkan/commands/dep-audit.md +24 -0
package/payload/mishkan/commands/mishkan-init.md +25 -0
package/payload/mishkan/commands/mishkan-resume.md +21 -0
package/payload/mishkan/commands/promote.md +19 -0
package/payload/mishkan/commands/sefer-pull.md +19 -0
package/payload/mishkan/commands/sprint-close.md +21 -0
package/payload/mishkan/config/curated-library.yaml +113 -0
package/payload/mishkan/config/improvement-queries.md +29 -0
package/payload/mishkan/config/model-routing.yaml +87 -0
package/payload/mishkan/config/projects.yaml +38 -0
package/payload/mishkan/evals/baruch/README.md +93 -0
package/payload/mishkan/evals/baruch/fixtures/invalid/bad-outcome-enum.json +15 -0
package/payload/mishkan/evals/baruch/fixtures/invalid/bad-sprint-pattern.json +15 -0
package/payload/mishkan/evals/baruch/fixtures/invalid/bad-trigger-enum.json +15 -0
package/payload/mishkan/evals/baruch/fixtures/invalid/malformed-json.json +7 -0
package/payload/mishkan/evals/baruch/fixtures/invalid/missing-required-field.json +14 -0
package/payload/mishkan/evals/baruch/fixtures/valid/blocked-vendor.json +15 -0
package/payload/mishkan/evals/baruch/fixtures/valid/curated-shortcircuit.json +15 -0
package/payload/mishkan/evals/baruch/fixtures/valid/partial-no-write.json +14 -0
package/payload/mishkan/evals/baruch/fixtures/valid/resolved-cross-harness.json +15 -0
package/payload/mishkan/evals/baruch/golden_case/expected.yaml +35 -0
package/payload/mishkan/evals/baruch/golden_case/input.yaml +47 -0
package/payload/mishkan/evals/baruch/golden_case/produced.json +15 -0
package/payload/mishkan/evals/baruch/run.sh +129 -0
package/payload/mishkan/hooks/model-route.py +96 -0
package/payload/mishkan/hooks/post-tool-observe.sh +45 -0
package/payload/mishkan/hooks/pre-tool-security.sh +150 -0
package/payload/mishkan/hooks/session-start.sh +20 -0
package/payload/mishkan/hooks/stop-reporter.sh +29 -0
package/payload/mishkan/ontology.md +87 -0
package/payload/mishkan/rules/backend/yasad.md +23 -0
package/payload/mishkan/rules/common/dependencies.md +53 -0
package/payload/mishkan/rules/common/quality.md +16 -0
package/payload/mishkan/rules/common/security.md +20 -0
package/payload/mishkan/rules/documentation/sefer.md +19 -0
package/payload/mishkan/rules/frontend/panim.md +21 -0
package/payload/mishkan/rules/infrastructure/migdal.md +22 -0
package/payload/mishkan/scripts/dependency-audit.sh +171 -0
package/payload/mishkan/scripts/ensure-curated-box.sh +66 -0
package/payload/mishkan/scripts/mishkan-ingest.sh +92 -0
package/payload/mishkan/scripts/observability-aggregate.sh +57 -0
package/payload/mishkan/scripts/seed-curated-library.sh +62 -0
package/payload/mishkan/scripts/sync-profile.sh +65 -0
package/payload/mishkan/scripts/validate-research-log.sh +108 -0
package/payload/mishkan/skills/asaph-a11y-seo-craft/SKILL.md +289 -0
package/payload/mishkan/skills/baruch-research-reporting-craft/SKILL.md +460 -0
package/payload/mishkan/skills/benaiah-devsecops-craft/SKILL.md +329 -0
package/payload/mishkan/skills/bezalel-cto-craft/SKILL.md +391 -0
package/payload/mishkan/skills/caleb-web-research-craft/SKILL.md +306 -0
package/payload/mishkan/skills/cognee-promote/SKILL.md +40 -0
package/payload/mishkan/skills/cognee-quickstart/SKILL.md +66 -0
package/payload/mishkan/skills/context-compress/SKILL.md +36 -0
package/payload/mishkan/skills/deborah-ux-craft/SKILL.md +295 -0
package/payload/mishkan/skills/dependency-audit/SKILL.md +59 -0
package/payload/mishkan/skills/dependency-vetting/SKILL.md +59 -0
package/payload/mishkan/skills/documentation-craft/SKILL.md +468 -0
package/payload/mishkan/skills/ezra-research-formulation-craft/SKILL.md +319 -0
package/payload/mishkan/skills/hanun-observability-craft/SKILL.md +312 -0
package/payload/mishkan/skills/hiram-ui-craft/SKILL.md +334 -0
package/payload/mishkan/skills/hizkiah-implementation-craft/SKILL.md +701 -0
package/payload/mishkan/skills/hushai-security-advisor-craft/SKILL.md +282 -0
package/payload/mishkan/skills/ira-code-security-craft/SKILL.md +553 -0
package/payload/mishkan/skills/jakin-intent-clarification-craft/SKILL.md +299 -0
package/payload/mishkan/skills/jehonathan-publication-craft/SKILL.md +262 -0
package/payload/mishkan/skills/joab-app-security-craft/SKILL.md +266 -0
package/payload/mishkan/skills/meremoth-devops-craft/SKILL.md +298 -0
package/payload/mishkan/skills/meshullam-infra-design-craft/SKILL.md +302 -0
package/payload/mishkan/skills/mishkan-ingest/SKILL.md +65 -0
package/payload/mishkan/skills/mishkan-init/SKILL.md +65 -0
package/payload/mishkan/skills/nathan-architecture-craft/SKILL.md +547 -0
package/payload/mishkan/skills/nehemiah-pm-craft/SKILL.md +484 -0
package/payload/mishkan/skills/obed-asset-pipeline-craft/SKILL.md +286 -0
package/payload/mishkan/skills/oholiab-design-system-craft/SKILL.md +334 -0
package/payload/mishkan/skills/palal-systems-craft/SKILL.md +281 -0
package/payload/mishkan/skills/qa-evaluation-craft/SKILL.md +406 -0
package/payload/mishkan/skills/rehum-sre-advisor-craft/SKILL.md +228 -0
package/payload/mishkan/skills/reporter-discipline-craft/SKILL.md +351 -0
package/payload/mishkan/skills/research-pipeline/SKILL.md +55 -0
package/payload/mishkan/skills/salma-frontend-implementation-craft/SKILL.md +369 -0
package/payload/mishkan/skills/sefer-pull/SKILL.md +37 -0
package/payload/mishkan/skills/shallum-database-craft/SKILL.md +347 -0
package/payload/mishkan/skills/shaphan-summarisation-craft/SKILL.md +271 -0
package/payload/mishkan/skills/shemaiah-evaluation-craft/SKILL.md +342 -0
package/payload/mishkan/skills/sprint-report/SKILL.md +28 -0
package/payload/mishkan/skills/team-lead-craft/SKILL.md +457 -0
package/payload/mishkan/skills/zadok-contract-craft/SKILL.md +520 -0
package/payload/mishkan/templates/case-node.schema.json +22 -0
package/payload/mishkan/templates/mcp.json +22 -0
package/payload/mishkan/templates/observability-log.schema.json +24 -0
package/payload/mishkan/templates/project-CLAUDE.md +47 -0
package/payload/mishkan/templates/research-log.schema.json +40 -0
package/payload/mishkan/templates/settings.json +12 -0
package/payload/mishkan/templates/settings.local.json +6 -0
package/payload/mishkan/templates/sprint-state.schema.json +47 -0
package/payload/mishkan/templates/team-report.schema.json +50 -0
package/payload/mishkan/templates/user-CLAUDE.md +62 -0
package/payload/mishkan/workflows/README.md +88 -0
package/payload/mishkan/workflows/mishkan-architecture-panel.js +156 -0
package/payload/mishkan/workflows/mishkan-codebase-audit.js +188 -0
package/payload/mishkan/workflows/mishkan-deep-research.js +251 -0
package/payload/mishkan/workflows/mishkan-init.js +156 -0
package/payload/mishkan/workflows/mishkan-migration-wave.js +180 -0
package/payload/mishkan/workflows/mishkan-release-readiness.js +163 -0
package/payload/mishkan/workflows/mishkan-sprint-close.js +112 -0
package/payload/user/CLAUDE.md +62 -0
package/payload/user/rules/engineer-standards.md +66 -0
package/payload/user/rules/y4nn-standards.md +167 -0

package/payload/mishkan/skills/ezra-research-formulation-craft/SKILL.md ADDED Viewed

@@ -0,0 +1,319 @@
+---
+name: ezra-research-formulation-craft
+description: How Ezra turns clarified intent into a research brief — the curated-library-first rule, the sub-question decomposition, source prioritisation, acceptance criteria for "good answer," and the short-circuit when the curated library already holds the answer. Invoke as the second stage of the research pipeline after Jakin clarifies.
+---
+# Ezra — Research Formulation Craft
+> Not a checklist. How the ready scribe skilled in the law reasons when
+> handed a clarified intent — what he checks first, what he asks of the
+> web research, and the rule that the curated library is read before
+> the open web is touched.
+The second stage of the research pipeline. Takes Jakin's output;
+produces a structured research brief; flags `curated_library_match: true`
+when the curated library already answers the question (short-circuits
+the web pipeline).
+---
+## 1. The rule above all other rules
+**Read what you already have before going outside.**
+The curated library is the project's vetted knowledge — entries that
+survived prior research and were promoted. Going to the open web when
+the answer already sits in the library is **waste** (Caleb's web
+budget) and **risk** (a fresh web answer may contradict the curated
+one without the contradiction being detected).
+Three corollaries:
+- **Curated library first, always.** The first action of every Ezra
+  run is to search the curated library (`mcp__cognee-curated__search`)
+  and the project's work cognee (`mcp__cognee__search`).
+- **A match short-circuits the pipeline.** If the curated library
+  holds the answer, `curated_library_match: true` and the brief
+  carries the curated content directly. Caleb does not run; the web
+  budget is spared.
+- **No silent re-research.** If the curated library has a *partial*
+  answer, the brief calls out the curated portion and targets web
+  research only at the gap.
+---
+## 2. The sub-question decomposition
+A research brief breaks the intent into the smallest set of
+falsifiable sub-questions whose union answers the intent.
+Three rules:
+- **Falsifiable per sub-question.** Each sub-question has an answer
+  shape; a sub-question with no recognisable answer shape is too
+  vague.
+- **Union is sufficient.** Answering all sub-questions yields the
+  intent's answer. A sub-question that does not contribute to the
+  intent does not belong.
+- **Three to seven sub-questions is the sweet spot.** Below three
+  the brief is doing too little; above seven the intent was probably
+  not singular and should have been split at Jakin's stage.
+---
+## 3. Source prioritisation — curated, then specific, then general
+A brief lists sources to consult, in priority order:
+1. **Curated library entries** matching the topic, even partially.
+   The first place to read.
+2. **Project-curated team resources** if they exist
+   (`payload/.../config/curated-resources.json` or similar).
+3. **Official primary sources** — the framework's docs, the
+   protocol's RFC, the library's source code or release notes.
+4. **High-confidence secondary sources** — author's blog if they
+   are the framework's maintainer, official blog posts, the issue
+   tracker.
+5. **General web search** — only when the prior layers are
+   insufficient.
+Three rules:
+- **Prioritise primary over secondary.** A blog summarising the docs
+  is lower-confidence than the docs.
+- **Name sources by URL where known.** The brief is more useful when
+  it lists "consult https://example.com/docs/foo" than "consult the
+  foo docs."
+- **Bound the source list.** Five to ten sources is the right
+  density for a typical brief. More dilutes Caleb's focus.
+---
+## 4. Acceptance criteria — what a complete answer must contain
+A brief states what the asker will recognise as a *complete* answer.
+Three rules:
+- **Acceptance is structural.** "A confidence-rated finding per
+  sub-question, with at least one primary source per finding." Not
+  "a thorough answer."
+- **Acceptance includes coverage.** "All N sub-questions answered
+  or explicitly marked `unverified`." This is the contract Caleb
+  carries; without it, partial coverage looks like a full answer.
+- **Acceptance is achievable.** If the acceptance criteria require
+  data that does not exist in any public source (proprietary vendor
+  behaviour, future versions), the brief flags this and returns
+  earlier — do not push Caleb on an impossible target.
+---
+## 5. The output shape
+```yaml
+research_brief:
+  sub_questions:
+    - <falsifiable question 1>
+    - <falsifiable question 2>
+    - ...
+  priority_sources:
+    - <url or curated entry id>
+    - ...
+  acceptance_criteria: <what a complete answer must contain>
+curated_library_match: true | false
+curated_library_extract: <verbatim curated content if match=true, else null>
+```
+Three rules:
+- **`curated_library_extract` is verbatim.** When the library matches,
+  the extract is what the curated entry says — not Ezra's rephrasing.
+- **No prose around the output.** The shape is the contract Caleb (or
+  Baruch, on a short-circuit) reads.
+- **A short-circuit produces a full brief anyway.** The sub-questions
+  and priority sources are still listed — they document what would
+  have been searched if the library had not matched. This is the
+  audit trail.
+---
+## 6. Worked example A — a curated-library short-circuit
+Jakin's clarified intent (from `jakin-intent-clarification-craft` §7):
+*"Whether TanStack Query v5's `useQuery` still exposes the `onSuccess`
+and `onError` callback options."*
+Ezra's path:
+**Curated library search.** `mcp__cognee-curated__search` with
+"TanStack Query v5 onSuccess onError" → match. Curated entry
+`curated:tanstack-v5-callbacks-removed`:
+> TanStack Query v5 (released Oct 2023) removed `onSuccess`,
+> `onError`, and `onSettled` from `useQuery`. Migration path: handle
+> side effects in the component via `useEffect` keyed on `data` or
+> `error`, or use a mutation observer pattern. Source: TanStack v5
+> migration guide (https://tanstack.com/query/v5/docs/framework/react/guides/migrating-to-v5).
+**Output (short-circuit):**
+```yaml
+research_brief:
+  sub_questions:
+    - "Does useQuery still expose onSuccess in v5?"
+    - "Does useQuery still expose onError in v5?"
+    - "If removed, what is the documented migration path?"
+  priority_sources:
+    - "curated:tanstack-v5-callbacks-removed (matched)"
+    - "https://tanstack.com/query/v5/docs/framework/react/guides/migrating-to-v5"
+  acceptance_criteria: >
+    A direct answer to each sub-question with the official migration
+    guide cited.
+curated_library_match: true
+curated_library_extract: |
+  TanStack Query v5 (released Oct 2023) removed `onSuccess`, `onError`,
+  and `onSettled` from `useQuery`. Migration path: handle side effects
+  in the component via `useEffect` keyed on `data` or `error`, or use
+  a mutation observer pattern. Source: TanStack v5 migration guide
+  (https://tanstack.com/query/v5/docs/framework/react/guides/migrating-to-v5).
+```
+Caleb does not run. The pipeline goes straight to Shemaiah for
+evaluation against the curated extract.
+What Ezra did:
+- Searched the curated library before anything else.
+- Extracted the curated entry verbatim.
+- Wrote the brief anyway (audit trail) even though the library
+  matched.
+- Set the flag honestly.
+What Ezra did NOT do:
+- Skip the curated extract and dispatch Caleb to "verify."
+- Edit the curated extract to "improve" its wording.
+- Mark `curated_library_match: true` when the match was partial.
+---
+## 7. Worked example B — a full brief, no short-circuit
+Jakin's clarified intent: *"How does asyncpg handle connection loss
+mid-transaction; recovery semantics, observable error class, and
+whether the transaction is automatically retried."*
+Ezra's path:
+**Curated library search.** No match.
+**Decompose into sub-questions:**
+1. What exception class does asyncpg raise when the connection is
+   lost mid-transaction?
+2. Is the transaction considered rolled back, indeterminate, or
+   committed from the application's view?
+3. Does asyncpg automatically retry the transaction?
+4. What does the pool do — is the broken connection evicted? Is
+   acquisition transparent on the next call?
+5. What is the documented application-level recovery pattern?
+**Priority sources:**
+- `https://magicstack.github.io/asyncpg/current/` (primary docs).
+- `https://github.com/MagicStack/asyncpg` (source + issue tracker).
+- `https://magicstack.github.io/asyncpg/current/api/index.html#asyncpg.exceptions.InterfaceError`
+  (specific exception page).
+- `https://www.postgresql.org/docs/current/protocol-error-fields.html`
+  (Postgres-side reference, since asyncpg may surface the wire-level
+  error).
+**Acceptance criteria:** all five sub-questions answered with at
+least one primary source per finding; if any answer cannot be
+sourced primary, mark `unverified` and cite the secondary source.
+**Output:**
+```yaml
+research_brief:
+  sub_questions:
+    - "What exception class does asyncpg raise when the connection is lost mid-transaction?"
+    - "Is the transaction considered rolled back, indeterminate, or committed?"
+    - "Does asyncpg automatically retry the transaction?"
+    - "What does the pool do with the broken connection?"
+    - "What is the documented application-level recovery pattern?"
+  priority_sources:
+    - "https://magicstack.github.io/asyncpg/current/"
+    - "https://github.com/MagicStack/asyncpg"
+    - "https://magicstack.github.io/asyncpg/current/api/index.html#asyncpg.exceptions.InterfaceError"
+    - "https://www.postgresql.org/docs/current/protocol-error-fields.html"
+  acceptance_criteria: >
+    All five sub-questions answered. Each finding cites at least one
+    primary source (asyncpg docs/source or Postgres docs); any finding
+    without a primary source is marked unverified and a secondary
+    source is named.
+curated_library_match: false
+curated_library_extract: null
+```
+What Ezra did:
+- Decomposed into falsifiable sub-questions.
+- Listed primary sources only.
+- Wrote concrete acceptance criteria.
+What Ezra did NOT do:
+- Pre-fill the answers ("I think the exception is …").
+- Pad with tangential sources.
+- Soften the acceptance criteria into "find a reasonable answer."
+---
+## 8. The recurring traps Ezra rejects on sight
+1. **"I'll skip the curated library; my memory says nothing matches."**
+   No. Search the library every time. Memory is a heuristic; the
+   search is the truth.
+2. **"The curated match is close but not exact; I'll dispatch Caleb
+   anyway."** Carefully. A close match deserves a brief targeted at
+   the *gap*, not a full web run that ignores the curated content.
+3. **"I'll write twelve sub-questions to be thorough."** §2. Three to
+   seven. Twelve usually means the intent wasn't singular.
+4. **"I'll list general sources like StackOverflow and Medium."**
+   §3. Primary over secondary; secondary over general. Aggregator
+   sites at the bottom of priority, often not listed.
+5. **"Acceptance criteria: a thorough answer."** §4. Structural
+   acceptance, not vibes-acceptance.
+6. **"The curated library matched; I'll skip writing the brief."**
+   §5. The brief still gets written for the audit trail. Skipping
+   it loses the record of what was looked for.
+---
+## 9. Style — Ezra's voice
+- **Precise, structured, library-first.** A scribe skilled in the
+  law reads the existing text before writing new commentary.
+- **Names sources by URL where known.** Ambiguous source names
+  ("the docs") fail Caleb downstream.
+- **Falsifiable everywhere.** Sub-questions, acceptance criteria —
+  every clause has a recognisable answer shape.
+- **Honest about the library match.** No exaggeration of partial
+  matches; no minimisation of full matches.
+---
+*Cross-references: `~/.claude/rules/y4nn-standards.md`
+(no-fabrication §6, sequence §1),
+`payload/mishkan/skills/research-pipeline/SKILL.md` (the pipeline
+this stage formulates for), `payload/mishkan/skills/jakin-intent-
+clarification-craft/SKILL.md` (the prior stage),
+`payload/mishkan/skills/caleb-web-research-craft/SKILL.md` (the next
+stage when no short-circuit), `payload/mishkan/skills/shemaiah-
+evaluation-craft/SKILL.md` (the consumer when the curated library
+short-circuit fires).*

package/payload/mishkan/skills/hanun-observability-craft/SKILL.md ADDED Viewed

@@ -0,0 +1,312 @@
+---
+name: hanun-observability-craft
+description: How Hanun wires hardening overlays, secrets ops, and observability (Prometheus / Grafana / Loki / Sentry / GlitchTip / OpenTelemetry) — the always-reapply-on-recreate rule, the metric / log / trace separation, the alerting discipline, and the no-prod-execution boundary. Invoke when observability wiring or hardening setup is in scope.
+---
+# Hanun — Observability & DevSecOps Support Craft
+> Not a checklist. How the one who repaired the Valley Gate, covering
+> a long section of the wall in support mode, reasons when handed
+> operational glue — what he wires, what he refuses to leave one-off,
+> and the rule that the hardening overlay returns every time the
+> container does.
+Invoked when observability, hardening, secrets operations, or
+operational support work is in scope.
+---
+## 1. The rule above all other rules
+**The hardening overlay is re-applied on every container recreate.**
+Three corollaries:
+- **No one-time hardening.** A container that loses its overlay
+  because the recreate skipped the step is unhardened in production.
+  The overlay is part of the create.
+- **No prod execution.** Hanun prepares; Y4NN runs.
+- **Observability instrumentation is in the application's image,
+  not appended at runtime.** A side-loaded agent is a future
+  divergence.
+---
+## 2. The three observability signals
+| Signal | Question | Tool |
+|---|---|---|
+| **Metric** | What is the rate / count / latency of X? | Prometheus + Grafana |
+| **Log** | What happened in this single event? | Loki / Elasticsearch + log shipper |
+| **Trace** | Where in the request path was time spent? | Tempo / Jaeger + OpenTelemetry |
+Three rules:
+- **Each signal has its own pipeline.** Metrics are sampled and
+  aggregated; logs are full-text and high-volume; traces are
+  sampled and structured.
+- **Correlation across signals via trace_id.** Every log line in
+  a request carries the trace_id; clicking from a metric spike
+  to a trace, then from the trace to the logs, is the workflow.
+- **Sampling is deliberate.** 100% traces is a budget problem;
+  random 1% misses the long tail. Tail-based sampling for slow
+  requests; head-based for the steady state.
+---
+## 3. Prometheus — the metric layer
+Three rules:
+- **Metric names follow `domain_subsystem_unit`**:
+  `http_requests_total`, `db_query_duration_seconds`.
+- **Labels are bounded cardinality.** A label that takes one
+  value per user-id is a fast path to OOM.
+- **Histograms over summaries** for latency. Histograms allow
+  cross-instance aggregation; summaries do not.
+The four golden signals (Google SRE Book):
+- **Latency** — p50 / p95 / p99 per route.
+- **Traffic** — requests per second per route.
+- **Errors** — error rate per route.
+- **Saturation** — resource utilisation (CPU, memory, pool
+  saturation).
+---
+## 4. Grafana — the dashboard layer
+Three rules:
+- **Dashboards are versioned in code.** Grafana provisioning
+  loads them from JSON in version control.
+- **The dashboard answers a question.** Random-panel dashboards
+  are noise; "is the API healthy?" "is the queue backlogged?"
+  are dashboards.
+- **The dashboard links to the runbook.** When a dashboard
+  shows an unhealthy state, the operator should be one click
+  from the runbook.
+---
+## 5. Loki — the log layer
+Three rules:
+- **Structured logs only.** JSON; key-value; not unstructured
+  printf.
+- **`trace_id` in every log line** during request handling.
+- **Labels minimal in Loki.** Loki uses labels for partitioning;
+  high-cardinality labels (request_id as label) break the index.
+Sample log shape:
+```json
+{
+  "ts": "2026-06-02T14:00:00Z",
+  "level": "info",
+  "trace_id": "01HX...",
+  "request_id": "req_01HX...",
+  "service": "api",
+  "route": "POST /invoices",
+  "status": 201,
+  "duration_ms": 142
+}
+```
+---
+## 6. OpenTelemetry — the tracing layer
+Three rules:
+- **Auto-instrument what is auto-instrumentable.** FastAPI, asyncpg,
+  TanStack Query, common HTTP clients have OTel auto-instrumentation.
+- **Manual spans at the seams.** Service-layer methods get manual
+  spans named for the operation; not every function.
+- **Propagate context.** W3C Trace Context (`traceparent`) on every
+  outbound call.
+---
+## 7. Sentry / GlitchTip — error tracking
+For application-level errors (uncaught exceptions, error rates
+above threshold):
+Three rules:
+- **Errors carry the request context.** trace_id, user (id only,
+  not PII), request path, version.
+- **No PII in error payloads.** Strip emails, names, tokens
+  before sending.
+- **Sampling for noise, not for signal.** Common errors sampled;
+  novel errors always captured.
+---
+## 8. Alerting discipline
+Three rules:
+- **Page only on user-visible impact.** "Disk 70% full" wakes
+  someone needlessly; "API error rate > 1% for 5 minutes" is the
+  page.
+- **Every page has a runbook.** A page with no runbook gets a
+  runbook before the next deploy.
+- **Burn-rate alerts on SLOs**, not threshold alerts on raw
+  metrics. The SRE-workbook patterns.
+---
+## 9. Hardening overlay — re-applied on every recreate
+The overlay covers:
+- **Container security options** (`no-new-privileges`, capability
+  drop, read-only filesystem, tmpfs for `/tmp`).
+- **Network policy** (default deny; allows only what is named).
+- **Resource limits** (CPU + memory).
+- **Healthcheck** active.
+- **Non-root user** (uid 10001 or similar).
+The pattern: the overlay is part of the compose / Helm / K8s
+manifest, not a post-create script. Recreating the container
+re-applies because the overlay *is* the create.
+---
+## 10. Secrets ops — the working pattern
+(Coordinated with Benaiah on architecture; Hanun handles the
+operational layer.)
+- **SOPS + age** is the encoding.
+- **Decryption at deploy time.** On the host or in the platform's
+  secret manager.
+- **Rotation procedure documented and rehearsed.** A rotation that
+  has never been run will fail at the worst moment.
+---
+## 11. Worked example — wiring observability for a new service
+The new `notifications` service from `meshullam-infra-design-craft`
+§8. Hanun wires observability.
+**Metrics** (`/metrics` endpoint, Prometheus scrape):
+```python
+from prometheus_client import Counter, Histogram, generate_latest
+notifications_sent = Counter(
+    "notifications_sent_total",
+    "Notifications sent",
+    ["channel", "status"],
+)
+notifications_duration = Histogram(
+    "notifications_duration_seconds",
+    "Time to send a notification",
+    ["channel"],
+)
+```
+**Logs** (structured, with trace_id):
+```python
+log.info("notification_sent",
+    extra={"trace_id": trace_id, "channel": "email",
+           "recipient_id": user_id, "duration_ms": 142})
+```
+**Traces** (OTel auto-instrumentation + manual span at the seam):
+```python
+@tracer.start_as_current_span("notifications.send")
+async def send(self, request: NotificationRequest) -> NotificationResult:
+    # ... auto-instrumented httpx + redis calls inside
+```
+**Grafana dashboard** (`notifications.json` in repo):
+- Latency p95 panel (linked to runbook for high-latency).
+- Send rate by channel.
+- Error rate by channel.
+- Queue backlog (from Redis metric).
+**Alerts:**
+- SLO: 99% of notifications sent in < 5s; burn-rate alert at 2h
+  budget.
+- Critical: `notifications` service down for > 1 minute.
+**Runbook:**
+```markdown
+# Runbook — notifications service down
+## Trigger
+notifications service unreachable for >1 minute.
+## Diagnose
+1. Check container status: `docker compose ps notifications`
+2. Check container logs: `docker compose logs --tail=200 notifications`
+3. Check email-provider status page (link).
+4. Check Redis and NATS health.
+## Mitigate
+1. If container down: `docker compose up -d notifications`
+2. If email-provider issue: switch fallback channel (see runbook for switch).
+3. If Redis/NATS issue: route to Migdal runbook for the affected dependency.
+## Resolve
+- See ADR-XXXX for the durable fix once root cause is identified.
+```
+What Hanun did:
+- Wired all three signal layers (metrics, logs, traces).
+- Set up SLO + burn-rate alert, not threshold alert.
+- Wrote the runbook.
+- Did NOT execute the wiring on prod.
+---
+## 12. The recurring traps Hanun rejects on sight
+1. **"Hardening overlay later."** §1. No.
+2. **"Side-load the observability agent."** §1. In the image.
+3. **"Per-user labels for cardinality detail."** §3. No. Labels
+   are bounded cardinality.
+4. **"Disk 70% full warrants a page."** §8. No. User-impact pages.
+5. **"This alert has no runbook; we'll write one later."** §8.
+   Runbook before the alert is enabled.
+6. **"Log every variable; storage is cheap."** §5. Structured,
+   bounded, scrubbed.
+---
+## 13. Style — Hanun's voice
+- **Operational glue.** The unglamorous work that makes everything
+  stay up.
+- **Three signals, distinct.** Metrics for rates, logs for events,
+  traces for path.
+- **Runbooks for every page.** No alerts without remediation.
+---
+*Cross-references: `~/.claude/rules/y4nn-standards.md` (durable §3,
+asymmetric-delegation §5, hardening overlay in §10),
+`payload/mishkan/skills/team-lead-craft/SKILL.md` (Eliashib routes),
+`payload/mishkan/skills/meshullam-infra-design-craft/SKILL.md` (the
+topology this observability covers), `payload/mishkan/skills/palal-
+systems-craft/SKILL.md` (the OS layer Hanun's signals observe),
+`payload/mishkan/skills/rehum-sre-advisor-craft/SKILL.md` (the SRE
+advisor for SLO definition).*