codex-genesis-harness 0.1.6 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (136) hide show
  1. package/.codebase/COMPRESSED_CONTEXT.md +80 -0
  2. package/.codebase/CURRENT_STATE.md +35 -8
  3. package/.codebase/DEPENDENCY_GRAPH.md +14 -1
  4. package/.codebase/IMPLEMENTATION_HANDOFF.md +34 -336
  5. package/.codebase/KNOWN_PROBLEMS.md +54 -3
  6. package/.codebase/MODULE_INDEX.md +8 -0
  7. package/.codebase/PIPELINE_FLOW.md +7 -5
  8. package/.codebase/RECOVERY_POINTS.md +15 -431
  9. package/.codebase/TECH_DEBT.md +6 -0
  10. package/.codebase/TEST_MATRIX.md +4 -3
  11. package/.codebase/VISUAL_GRAPH.md +127 -0
  12. package/.codebase/beads.json +16 -0
  13. package/.codebase/context-policy.json +68 -0
  14. package/.codebase/memories/lessons_learned.md +21 -0
  15. package/.codebase/memories/preferences.md +17 -0
  16. package/.codebase/state.json +45 -24
  17. package/.codex/skills/genesis-ai-provider/SKILL.md +1 -1
  18. package/.codex/skills/genesis-api-contract/SKILL.md +1 -1
  19. package/.codex/skills/genesis-api-sync/SKILL.md +1 -1
  20. package/.codex/skills/genesis-architecture/SKILL.md +6 -1
  21. package/.codex/skills/genesis-codebase-map/SKILL.md +1 -1
  22. package/.codex/skills/genesis-debug-guide/SKILL.md +11 -5
  23. package/.codex/skills/genesis-design-spec/SKILL.md +3 -3
  24. package/.codex/skills/genesis-docs-automation/SKILL.md +52 -973
  25. package/.codex/skills/genesis-executing-plans/SKILL.md +54 -0
  26. package/.codex/skills/genesis-executing-plans/agents/openai.yaml +6 -0
  27. package/.codex/skills/genesis-executing-plans/checklists/.gitkeep +0 -0
  28. package/.codex/skills/genesis-executing-plans/examples/.gitkeep +0 -0
  29. package/.codex/skills/genesis-executing-plans/templates/.gitkeep +0 -0
  30. package/.codex/skills/genesis-harness/SKILL.md +64 -1384
  31. package/.codex/skills/genesis-harness/scripts/check-docs-sync.sh +3 -3
  32. package/.codex/skills/genesis-harness/scripts/init-planning.sh +1 -1
  33. package/.codex/skills/genesis-harness-engineering/SKILL.md +1 -1
  34. package/.codex/skills/genesis-new-design/SKILL.md +6 -2
  35. package/.codex/skills/genesis-new-design/agents/openai.yaml +2 -0
  36. package/.codex/skills/genesis-observability-automation/SKILL.md +69 -303
  37. package/.codex/skills/genesis-observability-automation/references/common-mistakes-and-recovery.md +84 -0
  38. package/.codex/skills/genesis-observability-automation/references/workflow-phases.md +78 -0
  39. package/.codex/skills/genesis-performance-profiling/SKILL.md +1 -22
  40. package/.codex/skills/genesis-performance-profiling/agents/openai.yaml +1 -1
  41. package/.codex/skills/genesis-pipeline-orchestration/SKILL.md +1 -1
  42. package/.codex/skills/genesis-planning/SKILL.md +31 -1
  43. package/.codex/skills/genesis-release/SKILL.md +29 -1
  44. package/.codex/skills/genesis-research-first/SKILL.md +6 -0
  45. package/.codex/skills/genesis-spec-propagation/SKILL.md +52 -504
  46. package/.codex/skills/genesis-test-driven-development/SKILL.md +55 -0
  47. package/.codex/skills/genesis-test-driven-development/agents/openai.yaml +6 -0
  48. package/.codex/skills/genesis-test-driven-development/checklists/.gitkeep +0 -0
  49. package/.codex/skills/genesis-test-driven-development/examples/.gitkeep +0 -0
  50. package/.codex/skills/genesis-test-driven-development/templates/.gitkeep +0 -0
  51. package/.codex/skills/{ui-ux-test-skill → genesis-ui-ux-test}/SKILL.md +1 -1
  52. package/.codex/skills/genesis-upgrade-design/SKILL.md +4 -2
  53. package/.codex/skills/genesis-upgrade-design/agents/openai.yaml +2 -0
  54. package/.codex/skills/genesis-using-git-worktrees/SKILL.md +54 -0
  55. package/.codex/skills/genesis-using-git-worktrees/agents/openai.yaml +6 -0
  56. package/.codex/skills/genesis-using-git-worktrees/checklists/.gitkeep +0 -0
  57. package/.codex/skills/genesis-using-git-worktrees/examples/.gitkeep +0 -0
  58. package/.codex/skills/genesis-using-git-worktrees/templates/.gitkeep +0 -0
  59. package/.codex/skills/genesis-verification-before-completion/SKILL.md +53 -0
  60. package/.codex/skills/genesis-verification-before-completion/agents/openai.yaml +6 -0
  61. package/.codex/skills/genesis-verification-before-completion/checklists/.gitkeep +0 -0
  62. package/.codex/skills/genesis-verification-before-completion/examples/.gitkeep +0 -0
  63. package/.codex/skills/genesis-verification-before-completion/templates/.gitkeep +0 -0
  64. package/.codex/skills/spec-impact-engine/SKILL.md +77 -500
  65. package/.codex/skills/spec-impact-engine/checklists/checklist.md +10 -0
  66. package/.codex-plugin/plugin.json +3 -4
  67. package/CHANGELOG.md +17 -0
  68. package/README.EN.md +33 -22
  69. package/README.VI.md +36 -24
  70. package/README.md +46 -8
  71. package/VERSION +1 -1
  72. package/bin/genesis-harness.js +1337 -7
  73. package/contracts/features/registry-schema.json +15 -0
  74. package/contracts/observability/agent-run-schema.json +34 -0
  75. package/contracts/observability/failure-schema.json +35 -0
  76. package/contracts/ui/auth/login-screen-contract.json +43 -0
  77. package/features/REGISTRY.md +63 -0
  78. package/features/SCOPE-template.md +65 -0
  79. package/fixtures/planning/MOCKUP_PROMPT_TEMPLATE.md +16 -0
  80. package/observability/agent-runs/sample-run.json +13 -0
  81. package/observability/decision-logs/sample-decision.md +43 -0
  82. package/observability/failures/sample-failure.json +12 -0
  83. package/package.json +9 -3
  84. package/playwright/e2e/app-template.spec.js +37 -0
  85. package/playwright/e2e/auth/login-screen.spec.js +65 -0
  86. package/playwright/e2e/web-template.spec.js +28 -0
  87. package/scripts/check-scope.sh +100 -0
  88. package/scripts/cold-start-check.js +133 -0
  89. package/scripts/install.sh +6 -6
  90. package/scripts/prompt_sentinel.js +35 -4
  91. package/scripts/run-evals.sh +137 -26
  92. package/scripts/scratch_parser.js +49 -0
  93. package/scripts/spec_visual_sync.js +1 -1
  94. package/scripts/test_generator.js +2 -2
  95. package/scripts/uninstall.sh +6 -6
  96. package/scripts/verify.sh +21 -66
  97. package/tests/integration/cli-smoke.test.js +103 -0
  98. package/tests/unit/feature_registry.test.js +152 -0
  99. package/tests/unit/prompt_sentinel.test.js +1 -1
  100. package/tests/unit/spec_visual_sync.test.js +1 -1
  101. package/tests/unit/test_generator.test.js +1 -1
  102. package/.codex/skills/genesis-docs/SKILL.md +0 -46
  103. package/.codex/skills/genesis-docs/agents/openai.yaml +0 -7
  104. package/.codex/skills/genesis-mvp-planning/SKILL.md +0 -114
  105. package/.codex/skills/genesis-mvp-planning/agents/openai.yaml +0 -6
  106. package/.codex/skills/genesis-release-orchestration/SKILL.md +0 -653
  107. package/.codex/skills/genesis-release-orchestration/agents/openai.yaml +0 -7
  108. package/.codex/skills/genesis-research/SKILL.md +0 -46
  109. package/.codex/skills/genesis-research/agents/openai.yaml +0 -7
  110. package/playwright/e2e/e2e-template.md +0 -4
  111. /package/.codex/skills/{genesis-docs/checklists/checklist.md → genesis-docs-automation/checklists/manual-docs-checklist.md} +0 -0
  112. /package/.codex/skills/{genesis-docs/examples/example.md → genesis-docs-automation/examples/manual-docs-example.md} +0 -0
  113. /package/.codex/skills/{genesis-docs → genesis-docs-automation}/templates/docs-update-template.md +0 -0
  114. /package/.codex/skills/{genesis-state-machine/SKILL.md → genesis-harness/references/state-machine.md} +0 -0
  115. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/checklists/mvp-readiness.md +0 -0
  116. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/examples/5-phase-roadmap-example.md +0 -0
  117. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/templates/phase-1-core.md +0 -0
  118. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/templates/phase-2-auth.md +0 -0
  119. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/templates/phase-3-features.md +0 -0
  120. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/templates/phase-4-integrations.md +0 -0
  121. /package/.codex/skills/{genesis-mvp-planning → genesis-planning}/templates/phase-5-readiness.md +0 -0
  122. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/checklists/post-deployment-verification.md +0 -0
  123. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/checklists/pre-release-validation.md +0 -0
  124. /package/.codex/skills/{genesis-release-orchestration/examples/example.md → genesis-release/examples/orchestration-example.md} +0 -0
  125. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/observability/release-tracking.md +0 -0
  126. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/playbooks/canary-deployment-orchestration.md +0 -0
  127. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/playbooks/semantic-versioning-automation.md +0 -0
  128. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/templates/deployment-strategy-template.md +0 -0
  129. /package/.codex/skills/{genesis-release-orchestration → genesis-release}/templates/release-runbook-template.md +0 -0
  130. /package/.codex/skills/{genesis-research → genesis-research-first}/checklists/checklist.md +0 -0
  131. /package/.codex/skills/{genesis-research/examples/example.md → genesis-research-first/examples/manual-research-example.md} +0 -0
  132. /package/.codex/skills/{genesis-research → genesis-research-first}/templates/research-note-template.md +0 -0
  133. /package/.codex/skills/{ui-ux-test-skill → genesis-ui-ux-test}/agents/openai.yaml +0 -0
  134. /package/.codex/skills/{ui-ux-test-skill → genesis-ui-ux-test}/checklists/checklist.md +0 -0
  135. /package/.codex/skills/{ui-ux-test-skill → genesis-ui-ux-test}/examples/example.md +0 -0
  136. /package/.codex/skills/{ui-ux-test-skill → genesis-ui-ux-test}/templates/playwright-test-template.md +0 -0
@@ -11,8 +11,9 @@ if ! git rev-parse --is-inside-work-tree >/dev/null 2>&1; then
11
11
  fi
12
12
 
13
13
  changed="$(git diff --name-only HEAD 2>/dev/null || git diff --name-only)"
14
- docs_changed="$(printf '%s\n' "$changed" | grep -E '^(\.planning/|docs/|README\.md|AGENTS\.md)' || true)"
15
- code_changed="$(printf '%s\n' "$changed" | grep -Ev '^(\.planning/|docs/|README\.md|AGENTS\.md)$' || true)"
14
+ docs_pattern='^(\.planning/|\.codebase/|\.codex/SKILLS_INDEX\.md|docs/|README(\.[A-Z]{2})?\.md|AGENTS\.md|CHANGELOG\.md)'
15
+ docs_changed="$(printf '%s\n' "$changed" | grep -E "$docs_pattern" || true)"
16
+ code_changed="$(printf '%s\n' "$changed" | grep -Ev "$docs_pattern" || true)"
16
17
 
17
18
  if [ -n "$code_changed" ] && [ -z "$docs_changed" ]; then
18
19
  echo "Code changed but no planning/docs files changed."
@@ -21,4 +22,3 @@ if [ -n "$code_changed" ] && [ -z "$docs_changed" ]; then
21
22
  fi
22
23
 
23
24
  echo "Docs sync check passed or no code changes detected."
24
-
@@ -3,6 +3,7 @@ set -euo pipefail
3
3
 
4
4
  confirmed="${PROJECT_BRIEF_CONFIRMED:-0}"
5
5
  root="."
6
+ script_source="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
6
7
 
7
8
  while [ "$#" -gt 0 ]; do
8
9
  case "$1" in
@@ -741,7 +742,6 @@ EOF
741
742
  done
742
743
 
743
744
  mkdir -p .planning/scripts
744
- script_source="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
745
745
  for script in "$script_source"/*.sh; do
746
746
  cp "$script" ".planning/scripts/$(basename "$script")"
747
747
  chmod +x ".planning/scripts/$(basename "$script")"
@@ -1,5 +1,5 @@
1
1
  ---
2
- name: harness-engineering-skill
2
+ name: genesis-harness-engineering
3
3
  description: "Evolve the Codex harness itself: verification loops, repository memory, test-first scaffolds, resumability, observability, and autonomous workflow reliability. Use for changes to this repository's skill system or harness architecture."
4
4
  ---
5
5
 
@@ -17,7 +17,7 @@ Use when building a new web page, app screen, dashboard, tool, landing page, or
17
17
  Do not use for redesigning existing UI without first preserving behavior; use `genesis-upgrade-design` instead.
18
18
 
19
19
  ## Inputs required
20
- Product intent, target users, primary workflow, stack details, route or entry point, state list, and visual constraints.
20
+ Product intent, target users, primary workflow, stack details, route or entry point, state list, visual constraints, and the generated visual contract image `mockup.png` (which MUST be loaded via the `view_file` tool to inspect layouts, color choices, and dimensions before coding).
21
21
 
22
22
  ## Outputs required
23
23
  Implemented UI, UI contract, fixtures, responsive states, visual verification, and docs or memory updates.
@@ -55,6 +55,7 @@ If visual output fails, capture screenshot evidence, update the fixture or contr
55
55
 
56
56
  2. Define the design intent from the request:
57
57
  - Identify audience, product category, primary task, density, tone, and constraints.
58
+ - Proactively inspect `mockup.png` using the `view_file` tool to absorb the visual direction, color system, and placement details.
58
59
  - Choose one clear visual direction and commit to it across typography, color, spacing, surfaces, iconography, and motion.
59
60
  - For tools, dashboards, and operational apps, prioritize scanning, repeated use, compact controls, and predictable navigation over decorative hero layouts.
60
61
 
@@ -71,7 +72,10 @@ If visual output fails, capture screenshot evidence, update the fixture or contr
71
72
  5. Verify the result:
72
73
  - Run the project checks that prove the UI compiles.
73
74
  - Start the local dev server when needed.
74
- - Capture screenshots for visual work and inspect desktop/mobile layouts for overlap, clipping, unreadable text, blank canvases, and broken assets.
75
+ - **Automated Visual Verification Loop**:
76
+ 1. Use the `@modelcontextprotocol/server-puppeteer` tool to automatically navigate to the local dev server (e.g., `http://localhost:3000`) and capture a screenshot of the current UI.
77
+ 2. **Target Matching**: Use your Vision capabilities to compare your screenshot against `mockup.png`. Iterate and update the code until the UI matches 100%.
78
+ 3. **Aesthetic Audit**: Analyze your screenshot for aesthetic flaws (alignment, spacing, typography, bad AI-generated slop, inconsistent styles) and autonomously update the code to fix them. Repeat the screenshot capture until the UI looks premium.
75
79
 
76
80
  ## Design Rules
77
81
 
@@ -2,3 +2,5 @@ interface:
2
2
  display_name: "Genesis New Design"
3
3
  short_description: "— Thiết kế giao diện Web hiện đại, cao cấp"
4
4
  default_prompt: "Use $genesis-new-design to design and build a premium frontend web experience."
5
+ policy:
6
+ allow_implicit_invocation: true
@@ -7,123 +7,65 @@ description: "Automate observability architecture, monitoring dashboard config,
7
7
 
8
8
  ## Purpose
9
9
 
10
- The `genesis-observability-automation` skill automates the full lifecycle of observability for software services. It generates observability architecture diagrams (metrics/logs/traces topology), produces monitoring dashboard configurations for Grafana, Datadog, and CloudWatch, creates SLO-based alerting policies with escalation chains, automates health check configuration (readiness/liveness probes and SLA validation), and generates incident response runbooks for P0/P1/P2/P3 severity triage, resolution, and post-mortem.
10
+ Automates the full lifecycle of observability for software services: generates architecture diagrams (metrics/logs/traces topology), monitoring dashboard configs, SLO-based alerting policies, health check configuration, and incident response runbooks.
11
11
 
12
- This skill transforms observability from an afterthought into an engineering discipline. Every phase produces production-ready artifacts that integrate with standard monitoring stacks — no manual dashboard clicking, no ad-hoc alert configuration. Observability is code, version-controlled and reviewed like any other engineering artifact.
13
-
14
- **Core philosophy**: You cannot operate what you cannot see. You cannot respond to incidents you cannot detect. Observability must be designed before production launch, not retrofitted after the first outage. Every service must expose the three pillars (metrics, logs, traces) before it ships.
12
+ **Core philosophy**: You cannot operate what you cannot see. Observability must be designed before production launch. Every service must expose three pillars (metrics, logs, traces) before shipping.
15
13
 
16
14
  ---
17
15
 
18
16
  ## When to use
19
17
 
20
- Use `genesis-observability-automation` when:
21
-
22
- - A new service is approaching production and needs observability infrastructure before launch. Run this skill as part of the production readiness checklist.
23
- - An existing service is suffering repeated incidents due to lack of visibility (team finds out about problems from users, not monitors).
24
- - You are migrating monitoring stacks (e.g., from custom scripts to Prometheus + Grafana, or from on-prem to Datadog).
25
- - A post-mortem action item is "we need better monitoring" or "we need runbooks" — this skill produces both.
26
- - Sprint planning includes observability-related tickets (dashboard, alert, runbook) and you need to generate them efficiently.
27
- - An SRE or on-call rotation is being established and needs standard runbooks and escalation chains.
28
- - An audit or compliance review requires documented incident response procedures.
29
- - You need to validate that an existing service's observability meets a defined maturity level before certifying it as production-ready.
30
- - A new team member is joining on-call and needs structured runbooks to operate the service safely.
18
+ - A new service approaching production needs observability before launch.
19
+ - A service suffering repeated incidents due to lack of visibility.
20
+ - Migrating monitoring stacks (e.g., from scripts to Prometheus + Grafana).
21
+ - Post-mortem action items include "we need better monitoring" or "we need runbooks."
22
+ - Sprint planning includes observability tickets (dashboard, alert, runbook).
23
+ - An SRE or on-call rotation is being established and needs standard runbooks.
31
24
 
32
25
  ---
33
26
 
34
27
  ## When NOT to use
35
28
 
36
- Do NOT use `genesis-observability-automation` when:
37
-
38
- - The service is a prototype or demo that will never go to production. Observability infrastructure has maintenance cost do not invest it in throwaway code.
39
- - You only need a quick manual alert on a single metric. Use the monitoring tool's UI directly for one-off alerts.
40
- - The service already has mature, well-maintained observability and you only need to add one metric. Add the metric directly rather than regenerating the full architecture.
41
- - You need to diagnose a currently active incident. Use the existing runbooks and monitoring tools. This skill generates runbooks — it does not replace them during an emergency.
42
- - The monitoring stack has not been decided yet. Run `genesis-planning` first to select the monitoring stack, then return to this skill.
43
- - You need network-layer observability (packet capture, flow logs) — this skill covers application-layer observability. Use a dedicated network observability tool (e.g., Wireshark, VPC flow logs) for network-layer issues.
29
+ - Prototype or demo that will never go to production.
30
+ - Quick manual alert on a single metric (use the monitoring tool UI directly).
31
+ - Service already has mature observability and only needs one added metric.
32
+ - During an active incident (use existing runbooks this skill generates them, not replaces them).
33
+ - Monitoring stack not decided yet (run `genesis-planning` first).
44
34
 
45
35
  ---
46
36
 
47
37
  ## Inputs required
48
38
 
49
- Before invoking this skill, gather or confirm the following inputs:
50
-
51
- ### Service inputs
52
- - **Service name**: The canonical name of the service (used in all generated config names, e.g., `users-api`).
53
- - **Service language/runtime**: Node.js, Python, Go, Java (determines which instrumentation libraries to include).
54
- - **Service type**: REST API, gRPC service, background worker, streaming service, batch job (determines which RED metrics apply).
55
- - **Service endpoints or operations**: Complete list of endpoints/operations to monitor (with HTTP methods if applicable).
56
- - **Deployment platform**: Kubernetes, ECS, Lambda, Heroku, bare metal (determines probe types and service discovery config).
57
-
58
- ### Monitoring stack inputs
59
- - **Metrics stack**: Prometheus + Grafana | Datadog | CloudWatch | New Relic | None (select one).
60
- - **Logging stack**: ELK (Elasticsearch + Logstash + Kibana) | Loki + Grafana | Datadog Logs | CloudWatch Logs | None.
61
- - **Tracing stack**: Jaeger | Zipkin | AWS X-Ray | Datadog APM | OpenTelemetry (select one).
62
- - **Alerting tool**: PagerDuty | OpsGenie | Slack | VictorOps | Email (escalation chain target).
63
- - **On-call rotation**: List of team members in the on-call rotation, in order of escalation.
64
-
65
- ### SLO/SLA inputs
66
- - **Availability SLO** (e.g., 99.9%): What uptime percentage is required? Determines error budget.
67
- - **Latency SLO** (e.g., p95 < 200 ms): What response time is acceptable for 95% of requests?
68
- - **Error rate SLO** (e.g., < 0.1%): What error rate is acceptable?
69
- - **Throughput minimum** (e.g., ≥ 100 RPS): Minimum throughput for the service to be considered operational.
70
-
71
- ### Incident response inputs
72
- - **Escalation chain**: Primary on-call → Secondary on-call → Engineering manager → VP Engineering (with contact info / PagerDuty IDs).
73
- - **Communication channels**: Incident Slack channel, status page URL, customer communication channel.
74
- - **Service dependencies**: What external services does this service depend on? (Used in runbook dependency checks.)
75
- - **Rollback procedure**: How is the service rolled back? (kubectl rollout undo, feature flag, etc.)
76
- - **Business impact**: What is the customer impact if this service is down? (Used for severity classification.)
39
+ | Category | Required inputs |
40
+ |---|---|
41
+ | Service | name, language/runtime, type, endpoints/operations, deployment platform |
42
+ | Monitoring stack | metrics (Prometheus/Datadog/CloudWatch), logging, tracing, alerting tool |
43
+ | SLO/SLA | availability %, latency SLO, error rate SLO, throughput minimum |
44
+ | Incident response | escalation chain, communication channels, service dependencies, rollback procedure |
77
45
 
78
46
  ---
79
47
 
80
48
  ## Outputs required
81
49
 
82
- ### Phase 1 outputs
83
- - `observability-architecture.md`: Complete observability topology diagram showing metrics, logs, and traces collection paths, storage, and visualization layers.
84
- - `instrumentation-guide.md`: Service-specific instrumentation instructions (which libraries to add, what to instrument, structured logging format).
85
-
86
- ### Phase 2 outputs
87
- - `dashboards/service-overview.json`: Grafana dashboard JSON (or Datadog dashboard JSON) with RED metrics panels.
88
- - `dashboards/service-details.json`: Detailed drill-down dashboard with per-endpoint latency histograms, error breakdowns, and resource utilization.
89
- - `dashboards/slo-tracking.json`: SLO/error budget burn rate dashboard.
90
-
91
- ### Phase 3 outputs
92
- - `alerts/alert-rules.yml`: Prometheus alerting rules (or Datadog monitor configs) with SLO-based thresholds.
93
- - `alerts/escalation-chain.yml`: PagerDuty/OpsGenie escalation policy config.
94
- - `alerts/alert-silence-template.md`: Template for silencing alerts during planned maintenance.
95
-
96
- ### Phase 4 outputs
97
- - `health-checks/readiness-probe.yml`: Kubernetes readiness probe configuration.
98
- - `health-checks/liveness-probe.yml`: Kubernetes liveness probe configuration.
99
- - `health-checks/health-endpoint-spec.md`: Specification for the `/health`, `/readiness`, and `/metrics` endpoints.
50
+ | Phase | Outputs |
51
+ |---|---|
52
+ | Phase 1 | `observability-architecture.md`, `instrumentation-guide.md` |
53
+ | Phase 2 | `dashboards/service-overview.json`, `service-details.json`, `slo-tracking.json` |
54
+ | Phase 3 | `alerts/alert-rules.yml`, `escalation-chain.yml`, `alert-silence-template.md` |
55
+ | Phase 4 | `health-checks/readiness-probe.yml`, `liveness-probe.yml`, `health-endpoint-spec.md` |
56
+ | Phase 5 | `runbooks/p0-runbook.md`, `p1-runbook.md`, `p2-runbook.md`, `post-mortem-template.md`, `INCIDENT_LOG.md` |
100
57
 
101
- ### Phase 5 outputs
102
- - `runbooks/p0-runbook.md`: Production down — all-hands incident runbook.
103
- - `runbooks/p1-runbook.md`: Production degraded — on-call incident runbook.
104
- - `runbooks/p2-runbook.md`: Partial degradation — business hours runbook.
105
- - `runbooks/post-mortem-template.md`: Blameless post-mortem template with 5-whys.
106
- - `INCIDENT_LOG.md`: Running log of all incidents (initialize with this skill, append after each incident).
58
+ See `references/workflow-phases.md` for per-phase implementation detail.
107
59
 
108
60
  ---
109
61
 
110
62
  ## Required tests
111
63
 
112
- ### Architecture tests
113
- - [ ] `test/observability/instrumentation.test.js`: Verifies service exports metrics endpoint at `/metrics` with required metric names (RED metrics + process metrics).
114
- - [ ] `test/observability/health-endpoint.test.js`: Verifies `/health`, `/readiness`, and `/liveness` endpoints return correct schemas and status codes under normal conditions.
115
- - [ ] `test/observability/structured-logging.test.js`: Verifies that all log output is valid JSON with required fields (timestamp, level, service, trace_id).
116
-
117
- ### Dashboard tests
118
- - [ ] `test/observability/dashboard-schema.test.js`: Validates generated Grafana dashboard JSON against Grafana's schema (all panels have valid datasource refs, correct query syntax).
119
- - [ ] `test/observability/dashboard-completeness.test.js`: Verifies dashboard has all required panels (rate, errors, duration, saturation).
120
-
121
- ### Alert tests
122
- - [ ] `test/observability/alert-rules-valid.test.js`: Validates Prometheus alert rules YAML with `promtool check rules alert-rules.yml`.
123
- - [ ] `test/observability/alert-threshold-coverage.test.js`: Verifies alert rules cover all required SLO burn rate windows (1h, 6h, 24h, 3d).
124
-
125
- ### Runbook tests
126
- - [ ] `test/observability/runbook-completeness.test.js`: Verifies each runbook has all required sections (severity definition, triage steps, escalation, resolution, post-mortem).
64
+ - `test/observability/instrumentation.test.js` — verifies `/metrics` endpoint exports RED metrics
65
+ - `test/observability/health-endpoint.test.js` verifies `/health`, `/readiness`, `/liveness` schemas
66
+ - `test/observability/structured-logging.test.js` verifies all logs are valid JSON with required fields
67
+ - `test/observability/dashboard-schema.test.js` validates Grafana dashboard JSON schema
68
+ - `test/observability/alert-rules-valid.test.js` — validates Prometheus alert rules with `promtool`
127
69
 
128
70
  All tests must pass against fixtures in `fixtures/observability/`.
129
71
 
@@ -131,252 +73,76 @@ All tests must pass against fixtures in `fixtures/observability/`.
131
73
 
132
74
  ## Required fixtures
133
75
 
134
- - `fixtures/observability/monitoring-config-expected.json`: Prometheus scrape config + Grafana dashboard spec with correct structure.
135
- - `fixtures/observability/alert-policy-expected.json`: SLO-based alert rules with correct threshold calculations.
136
- - `fixtures/observability/incident-runbook-expected.json`: P1 incident runbook with correct structure and all required sections.
76
+ - `fixtures/observability/monitoring-config-expected.json`
77
+ - `fixtures/observability/alert-policy-expected.json`
78
+ - `fixtures/observability/incident-runbook-expected.json`
137
79
 
138
80
  ---
139
81
 
140
82
  ## Required contract updates
141
83
 
142
- Update the following when this skill's outputs change:
84
+ Update when outputs change:
85
+ - `contracts/observability/dashboard-schema.contract.json`
86
+ - `contracts/observability/alert-rule-schema.contract.json`
87
+ - `contracts/observability/health-endpoint.contract.json`
88
+ - `contracts/observability/agent-run-schema.json` (harness-level observability schema)
143
89
 
144
- - `contracts/observability/dashboard-schema.contract.json`: JSON Schema for generated dashboard configs. Update when new panel types are added.
145
- - `contracts/observability/alert-rule-schema.contract.json`: JSON Schema for alert rule configs. Update when threshold calculation logic changes.
146
- - `contracts/observability/health-endpoint.contract.json`: API contract for `/health`, `/readiness`, `/liveness` endpoints. Update when health check format changes.
147
-
148
- Contract update procedure:
149
- 1. Bump `version` field.
150
- 2. Set `changed_at` to current ISO timestamp.
151
- 3. Add `changelog` entry.
152
- 4. Re-run fixture tests.
153
- 5. Notify any consumers of the contract (teams using the generated configs).
90
+ Contract update procedure: bump `version`, set `changed_at`, add `changelog` entry, re-run fixture tests.
154
91
 
155
92
  ---
156
93
 
157
94
  ## Required codebase map updates
158
95
 
159
96
  After completing observability setup:
160
-
161
- ### `.codebase/CURRENT_STATE.md`
162
- - Add: `Observability: [service] instrumented [date]. Stack: [Prometheus|Datadog|CloudWatch].`
163
- - Add: `Runbooks: P0/P1/P2 runbooks generated for [service].`
164
-
165
- ### `.codebase/MODULE_INDEX.md`
166
- - Add entries for dashboard JSON files, alert rule files, and runbook files.
167
- - Add entries for any new health check endpoints added to the service.
168
-
169
- ### `observability/INCIDENT_LOG.md`
170
- - Initialize with the service name, observability architecture summary, and `No incidents yet` placeholder.
97
+ - `.codebase/CURRENT_STATE.md`: add `Observability: [service] instrumented [date]`
98
+ - `.codebase/MODULE_INDEX.md`: add entries for dashboard JSON, alert rule, runbook files
99
+ - `observability/INCIDENT_LOG.md`: initialize with service name + `No incidents yet`
171
100
 
172
101
  ---
173
102
 
174
103
  ## Token saving rules
175
104
 
176
- 1. **Reference existing dashboards**: If a dashboard for this service exists, diff it against requirements — do not regenerate the whole dashboard.
177
- 2. **Generate only needed runbooks**: Generate runbooks for the severity levels that apply. A simple internal tool only needs P1/P2 runbooks, not a P0 all-hands procedure.
178
- 3. **Reuse alert templates**: Base new alert rules on the existing template in `templates/alerting-policy-template.md`. Fill in thresholds, do not rewrite the structure.
179
- 4. **Summarize topology, don't draw ASCII art in prompts**: Reference `observability-architecture.md` by name, not by embedding it in subsequent prompts.
180
- 5. **Batch all dashboard panels**: Generate all Grafana panels in one pass. Do not loop back to add individual panels.
181
- 6. **Skip tracing config if no tracing stack selected**: If the team has not adopted distributed tracing, skip Phase 1 tracing topology and Phase 2 trace-based panels.
182
- 7. **Use compact JSON for dashboard fixtures**: Minify dashboard JSON in fixtures to reduce token consumption in test comparison.
105
+ 1. If a dashboard exists, diff it against requirements — do not regenerate the whole dashboard.
106
+ 2. Generate only runbooks for severity levels that apply (simple internal tool: P1/P2 only).
107
+ 3. Base new alert rules on `templates/alerting-policy-template.md` fill thresholds, not structure.
108
+ 4. Reference `observability-architecture.md` by name in prompts, not by embedding it.
109
+ 5. Generate all Grafana panels in one pass do not loop back for individual panels.
110
+ 6. Skip tracing config if no tracing stack selected.
111
+ 7. Use compact JSON for dashboard fixtures to reduce token consumption.
183
112
 
184
113
  ---
185
114
 
186
115
  ## Acceptance criteria
187
116
 
188
- Observability setup is COMPLETE and ACCEPTED when ALL of the following are true:
189
-
190
- ### Instrumentation
191
- - [ ] Service exports `/metrics` endpoint (Prometheus format) or sends metrics to the configured metrics backend.
192
- - [ ] All RED metrics are present: `requests_total`, `request_duration_seconds`, `request_errors_total`.
193
- - [ ] All logs are structured JSON with: `timestamp`, `level`, `service`, `trace_id`, `span_id`, `message`.
194
- - [ ] Distributed traces are being collected (if tracing stack is configured).
195
-
196
- ### Dashboards
197
- - [ ] Service overview dashboard exists and is deployed to the monitoring stack.
198
- - [ ] Dashboard shows: Rate (RPS), Errors (error rate %), Duration (p50/p95/p99), Saturation (CPU, memory, connection pool).
199
- - [ ] SLO tracking panel shows current error budget remaining.
200
- - [ ] Dashboard is linked from the service's README or internal wiki.
201
-
202
- ### Alerts
203
- - [ ] SLO burn rate alerts exist for fast burn (1h/6h windows) and slow burn (24h/3d windows).
204
- - [ ] All alerts have: `severity` label, `runbook_url` annotation, `description` annotation.
205
- - [ ] Escalation chain is configured in the alerting tool (PagerDuty/OpsGenie).
206
- - [ ] At least one alert has been test-fired to verify the escalation chain works end-to-end.
207
-
208
- ### Health checks
209
- - [ ] `/health` endpoint exists and returns `{"status": "ok"}` with HTTP 200 when healthy.
210
- - [ ] `/readiness` probe is configured in Kubernetes (or equivalent).
211
- - [ ] `/liveness` probe is configured with appropriate failure thresholds.
212
-
213
- ### Runbooks
214
- - [ ] P1 runbook exists and covers: detection, triage, escalation, resolution, post-mortem.
215
- - [ ] Runbook is linked in all alert `runbook_url` annotations.
216
- - [ ] Runbook has been reviewed by the on-call team and is accessible without production access (stored in wiki or repo).
217
- - [ ] Post-mortem template is ready for use.
218
-
219
- ---
220
-
221
- ## Common mistakes
117
+ **Instrumentation**: `/metrics` endpoint exports `requests_total`, `request_duration_seconds`, `request_errors_total`. Logs are structured JSON with `timestamp`, `level`, `service`, `trace_id`.
222
118
 
223
- ### Mistake 1: Alerting on symptoms without causes
224
- **Problem**: Alert fires on "CPU > 80%" but CPU being high is a symptom, not a cause. On-call engineer doesn't know what to do.
225
- **Fix**: Alert on user-facing symptoms (error rate, latency) and provide runbooks that help diagnose the underlying cause. Pair symptom alerts with diagnostic links.
119
+ **Dashboards**: RED metrics (Rate, Errors, Duration) + Saturation panels. SLO tracking panel shows error budget remaining.
226
120
 
227
- ### Mistake 2: Alert fatigue from too many low-quality alerts
228
- **Problem**: 50 alerts firing every week, most of which are noise. On-call engineers start ignoring alerts ("boy who cried wolf").
229
- **Fix**: Start with only SLO-based alerts. Achieve < 5 actionable alerts per week. Every alert must have a runbook and a clear action. Regularly review false positive rates.
121
+ **Alerts**: SLO burn rate alerts for 1h/6h/24h/3d windows. All alerts have `severity`, `runbook_url`, `description`. Escalation chain test-fired end-to-end.
230
122
 
231
- ### Mistake 3: Dashboards without context
232
- **Problem**: Dashboard shows a graph going up but no reference line to know if that's good or bad.
233
- **Fix**: Every metric panel must have a reference line or annotation showing the SLA target, the previous week's baseline, or an absolute threshold. "Is this normal?" should be answerable from the dashboard alone.
123
+ **Health checks**: `/health` 200 OK. `/readiness` → checks actual dependencies. K8s probes configured.
234
124
 
235
- ### Mistake 4: Missing the "long tail" in alerting windows
236
- **Problem**: Alert only fires when error rate > 5% for 5 minutes. A slow 0.5% burn for 48 hours exhausts the entire monthly error budget without triggering any alert.
237
- **Fix**: Implement multi-window alerting: fast burn (≥ 2% in 1 hour), medium burn (≥ 5% in 6 hours), slow burn (≥ 10% in 3 days). Cover both fast and slow failures.
125
+ **Runbooks**: P1 runbook covers detection/triage/escalation/resolution/post-mortem. Linked from alert `runbook_url`. Reviewed by on-call team.
238
126
 
239
- ### Mistake 5: Runbooks that only the original author can follow
240
- **Problem**: Runbook says "check the logs" without specifying where, how, or what to look for. New on-call engineer is lost.
241
- **Fix**: Write runbooks for the most junior person on the rotation. Include exact commands to run, exact queries to execute, and exact thresholds that indicate each diagnosis. Link to the monitoring dashboard directly.
127
+ ---
242
128
 
243
- ### Mistake 6: Health checks that always return 200
244
- **Problem**: `/health` endpoint returns HTTP 200 even when the database is unreachable. Kubernetes load balancer continues routing traffic to a broken pod.
245
- **Fix**: Health check must verify actual service dependencies. `/readiness` should check DB connectivity, cache connectivity, and any critical downstream dependencies. Return 503 if any dependency is unhealthy.
129
+ ## Common mistakes
246
130
 
247
- ### Mistake 7: Observability only in production
248
- **Problem**: Monitoring is only set up for production. Issues are invisible in staging and only discovered after production deployment.
249
- **Fix**: Deploy the same observability stack in staging. Run integration tests against the `/metrics` endpoint. Validate alert rules in staging before production.
131
+ See `references/common-mistakes-and-recovery.md` for the full list of 8 mistakes with fixes.
250
132
 
251
- ### Mistake 8: Missing trace context in logs
252
- **Problem**: Logs don't include `trace_id` or `span_id`. When investigating an incident, there's no way to correlate a specific user request across microservices.
253
- **Fix**: Inject `trace_id` and `span_id` into all log lines using OpenTelemetry or manual propagation. This is the #1 enabler of fast incident resolution in distributed systems.
133
+ **Top 3**:
134
+ - M1: Alerting on symptoms (CPU high) not user-facing signals (error rate, latency)
135
+ - M4: Missing multi-window alerting slow burn exhausts budget without triggering alert
136
+ - M6: Health checks always returning 200 even when dependencies are down
254
137
 
255
138
  ---
256
139
 
257
140
  ## Recovery workflow
258
141
 
259
- ### Recovery 1: Metrics not appearing in dashboard
260
- ```
261
- Symptom: Dashboard shows "No data" for all panels.
262
- Step 1: Verify service is running: kubectl get pods -n [namespace]
263
- Step 2: Check /metrics endpoint directly: curl http://service-host:9090/metrics | grep http_requests_total
264
- Step 3: Check Prometheus scrape config: kubectl describe servicemonitor [name] -n monitoring
265
- Step 4: Check Prometheus targets: open Prometheus UI → Status → Targets → look for service in targets list
266
- Step 5: If service is in targets but metrics missing: check instrumentation code (is metrics library initialized before first request?)
267
- Step 6: If service is NOT in targets: check Prometheus scrape config selector labels match service labels
268
- ```
269
-
270
- ### Recovery 2: Alert not firing when it should
271
- ```
272
- Symptom: Error rate is clearly high but no alert fired.
273
- Step 1: Check Prometheus alert rule with: promtool check rules alert-rules.yml
274
- Step 2: Evaluate the alert expression manually in Prometheus: Prometheus UI → Graph → paste alert expression
275
- Step 3: Check alert state: Prometheus UI → Alerts → find the alert rule → check if it's in "Inactive" state
276
- Step 4: If alert is firing but not notifying: check Alertmanager config and routing rules
277
- Step 5: Check Alertmanager status: kubectl exec -n monitoring alertmanager-pod -- amtool alert
278
- Step 6: Test escalation manually: send a test alert through PagerDuty/OpsGenie UI
279
- ```
280
-
281
- ### Recovery 3: Health check causing false pod restarts
282
- ```
283
- Symptom: Pods are being killed by liveness probe even though service is working.
284
- Step 1: Check liveness probe config: kubectl describe pod [pod-name] | grep -A 10 Liveness
285
- Step 2: Check if probe timeout is too short (default is 1s — increase if health check queries DB)
286
- Step 3: Check if failure threshold is too low (default is 3 consecutive failures — may be too aggressive)
287
- Step 4: Check /health endpoint response time under load: does it exceed liveness probe timeout?
288
- Step 5: Fix: increase timeoutSeconds, increase failureThreshold, or optimize the health check endpoint
289
- Step 6: Recommended safe config: initialDelaySeconds: 30, timeoutSeconds: 5, failureThreshold: 5
290
- ```
291
-
292
- ### Recovery 4: Runbook is wrong or out of date
293
- ```
294
- Symptom: On-call engineer followed runbook but steps don't work or are inaccurate.
295
- Step 1: Immediately annotate the incorrect step with [OUTDATED: <brief note>] so next person is warned.
296
- Step 2: After incident is resolved, open a PR to correct the runbook.
297
- Step 3: Re-run the corrected runbook steps in a test environment to verify they work.
298
- Step 4: Add incident to post-mortem action items: "Update runbook [name] step [N]."
299
- Step 5: Assign runbook review to the on-call engineer who caught the error.
300
- ```
301
-
302
- ---
303
-
304
- ## Workflow Detail: Phase-by-Phase Execution
305
-
306
- ### Phase 1: Observability Architecture Generation
307
-
308
- **Goal**: Design and document the complete observability topology before writing any configuration.
309
-
310
- **Architecture components to define:**
311
-
312
- | Pillar | Component | Purpose |
313
- |--------|-----------|---------|
314
- | Metrics | Prometheus / Datadog Agent | Scrape and store numeric time-series |
315
- | Metrics | Grafana / Datadog Dashboards | Visualize and alert on metrics |
316
- | Logs | Structured logging library | Produce machine-readable log events |
317
- | Logs | Log aggregator (Loki/ELK/CloudWatch) | Collect and index logs |
318
- | Logs | Kibana/Grafana/Datadog | Search and visualize logs |
319
- | Traces | OpenTelemetry SDK | Instrument service for tracing |
320
- | Traces | Jaeger/Zipkin/Datadog APM | Collect and visualize traces |
321
-
322
- **Service instrumentation requirements (by language):**
323
- - Node.js: `prom-client` (metrics), `winston` or `pino` (structured logs), `@opentelemetry/sdk-node` (traces).
324
- - Python: `prometheus_client` (metrics), `structlog` or `python-json-logger` (logs), `opentelemetry-sdk` (traces).
325
- - Go: `prometheus/client_golang` (metrics), `zap` or `logrus` (logs), `go.opentelemetry.io/otel` (traces).
326
-
327
- ### Phase 2: Dashboard Generation
328
-
329
- **Required panels for every service dashboard:**
330
-
331
- RED metrics (the minimum viable dashboard for any service):
332
- - **Rate**: Requests per second (total and per endpoint).
333
- - **Errors**: Error rate percentage (4xx and 5xx separately).
334
- - **Duration**: Response time as a histogram with p50, p95, p99 lines.
335
-
336
- SATURATION metrics (resource utilization):
337
- - **CPU**: Process CPU utilization %.
338
- - **Memory**: Heap and RSS memory.
339
- - **Connection pool**: Active connections vs. pool limit.
340
- - **Queue depth**: (For background workers) — job queue length.
341
-
342
- See `templates/monitoring-dashboard-template.md` for complete Grafana JSON scaffold.
343
-
344
- ### Phase 3: Alerting Policy Generation
345
-
346
- **SLO-based alert threshold calculation:**
347
-
348
- For a 99.9% availability SLO (monthly error budget = 43.8 minutes):
349
-
350
- ```
351
- Fast burn alert (1h window):
352
- Threshold: error_rate > 2% for 1 hour
353
- Reason: 2% error rate burns 2% of monthly budget per hour = exhausted in 50 hours
354
- Action: Page on-call immediately (P1)
355
-
356
- Medium burn alert (6h window):
357
- Threshold: error_rate > 0.5% for 6 hours
358
- Reason: 0.5% × 6h = 3% of monthly budget consumed
359
- Action: Page on-call (P2 — business hours response acceptable)
360
-
361
- Slow burn alert (3d window):
362
- Threshold: error_rate > 0.1% for 72 hours
363
- Reason: 0.1% × 72h = 7.2% of monthly budget consumed silently
364
- Action: Slack notification + ticket creation (investigate next sprint)
365
- ```
366
-
367
- See `templates/alerting-policy-template.md` for complete Prometheus alerting rules.
368
-
369
- ### Phase 4: Health Check Automation
370
-
371
- **Standard health endpoint specification:**
372
- - `GET /health` → 200 OK always (used by load balancers for basic routing).
373
- - `GET /readiness` → 200 if all dependencies healthy, 503 if any dependency unhealthy.
374
- - `GET /liveness` → 200 if process is alive and event loop is not stuck, 503 if deadlocked.
375
- - `GET /metrics` → Prometheus text format metrics.
376
-
377
- ### Phase 5: Incident Response Runbook Generation
378
-
379
- **Runbook structure requirements:**
380
- Every runbook must have: Severity definition, Detection signals, Triage steps (ordered, with commands), Escalation triggers, Resolution steps, Rollback procedure, Communication templates, Post-mortem checklist.
142
+ See `references/common-mistakes-and-recovery.md` for the full recovery playbooks R1–R4.
381
143
 
382
- See `playbooks/incident-triage-playbook.md` for complete P0/P1/P2/P3 runbooks.
144
+ **Quick reference**:
145
+ - R1: Metrics missing → check `/metrics` endpoint, Prometheus targets, scrape config labels
146
+ - R2: Alert not firing → `promtool check rules`, check Alertmanager routing
147
+ - R3: False pod restarts → increase `timeoutSeconds` and `failureThreshold` in liveness probe
148
+ - R4: Runbook outdated → annotate `[OUTDATED]`, fix post-incident, re-run in test environment
@@ -0,0 +1,84 @@
1
+ # Observability — Common Mistakes & Recovery Playbook
2
+
3
+ Tài liệu chi tiết về lỗi phổ biến và cách phục hồi.
4
+ Được gọi bởi `genesis-observability-automation/SKILL.md` → `## Common mistakes` và `## Recovery workflow`.
5
+
6
+ ---
7
+
8
+ ## Common Mistakes
9
+
10
+ ### M1: Alerting on symptoms without causes
11
+ Alert trên "CPU > 80%" nhưng CPU cao là triệu chứng, không phải nguyên nhân.
12
+ **Fix**: Alert trên user-facing symptoms (error rate, latency). Pair với runbooks chẩn đoán.
13
+
14
+ ### M2: Alert fatigue from too many low-quality alerts
15
+ 50 alerts/week — hầu hết là noise. On-call bắt đầu ignore.
16
+ **Fix**: Chỉ dùng SLO-based alerts. Mục tiêu < 5 actionable alerts/week.
17
+
18
+ ### M3: Dashboards without context
19
+ Graph đi lên nhưng không biết đó tốt hay xấu.
20
+ **Fix**: Mỗi metric panel phải có reference line hoặc SLA target annotation.
21
+
22
+ ### M4: Missing "long tail" in alerting windows
23
+ Alert chỉ fires khi error > 5% trong 5 phút. Slow burn 0.5% trong 48 giờ exhausts budget silently.
24
+ **Fix**: Multi-window alerting: fast burn (1h), medium burn (6h), slow burn (3d).
25
+
26
+ ### M5: Runbooks only the author can follow
27
+ Runbook nói "check the logs" không chỉ rõ ở đâu, cách nào.
28
+ **Fix**: Viết cho người junior nhất trong team. Include exact commands, queries, thresholds.
29
+
30
+ ### M6: Health checks that always return 200
31
+ `/health` trả 200 dù DB unreachable. K8s vẫn route traffic đến broken pod.
32
+ **Fix**: Health check phải verify actual dependencies. Return 503 nếu dependency unhealthy.
33
+
34
+ ### M7: Observability only in production
35
+ Monitoring chỉ ở prod — issues invisible ở staging.
36
+ **Fix**: Deploy cùng observability stack ở staging. Run integration tests vs `/metrics` endpoint.
37
+
38
+ ### M8: Missing trace context in logs
39
+ Logs không có `trace_id`/`span_id`. Không thể correlate request across microservices.
40
+ **Fix**: Inject `trace_id` và `span_id` vào mọi log line via OpenTelemetry.
41
+
42
+ ---
43
+
44
+ ## Recovery Playbook
45
+
46
+ ### R1: Metrics not appearing in dashboard
47
+ ```
48
+ Symptom: Dashboard shows "No data" for all panels.
49
+ 1. Verify service running: kubectl get pods -n [namespace]
50
+ 2. Check /metrics directly: curl http://service:9090/metrics | grep http_requests_total
51
+ 3. Check Prometheus scrape config: kubectl describe servicemonitor [name] -n monitoring
52
+ 4. Check Prometheus targets: Prometheus UI → Status → Targets
53
+ 5. If in targets but missing: check instrumentation code init order
54
+ 6. If NOT in targets: check scrape config selector labels
55
+ ```
56
+
57
+ ### R2: Alert not firing when it should
58
+ ```
59
+ Symptom: Error rate clearly high but no alert.
60
+ 1. Validate rules: promtool check rules alert-rules.yml
61
+ 2. Evaluate expression manually: Prometheus UI → Graph
62
+ 3. Check alert state: Prometheus UI → Alerts
63
+ 4. If firing but not notifying: check Alertmanager routing
64
+ 5. Check Alertmanager: kubectl exec alertmanager -- amtool alert
65
+ 6. Test escalation manually via PagerDuty/OpsGenie UI
66
+ ```
67
+
68
+ ### R3: Health check causing false pod restarts
69
+ ```
70
+ Symptom: Pods killed by liveness probe though service works.
71
+ 1. Check probe: kubectl describe pod [pod] | grep -A 10 Liveness
72
+ 2. Increase timeout if probe queries DB (default 1s often too short)
73
+ 3. Increase failureThreshold if too aggressive (default 3)
74
+ 4. Recommended safe config: initialDelaySeconds:30, timeoutSeconds:5, failureThreshold:5
75
+ ```
76
+
77
+ ### R4: Runbook out of date
78
+ ```
79
+ Symptom: On-call followed runbook but steps don't work.
80
+ 1. Annotate incorrect step immediately: [OUTDATED: <brief note>]
81
+ 2. After incident resolved, open PR to correct
82
+ 3. Re-run corrected steps in test environment
83
+ 4. Add to post-mortem: "Update runbook [name] step [N]"
84
+ ```