codex-genesis-harness 0.1.7 → 0.1.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.codebase/COMPRESSED_CONTEXT.md +80 -0
- package/.codebase/CURRENT_STATE.md +37 -11
- package/.codebase/DEPENDENCY_GRAPH.md +14 -1
- package/.codebase/IMPLEMENTATION_HANDOFF.md +34 -336
- package/.codebase/KNOWN_PROBLEMS.md +54 -3
- package/.codebase/MODULE_INDEX.md +8 -0
- package/.codebase/PIPELINE_FLOW.md +7 -5
- package/.codebase/RECOVERY_POINTS.md +17 -78
- package/.codebase/TECH_DEBT.md +6 -0
- package/.codebase/TEST_MATRIX.md +4 -3
- package/.codebase/VISUAL_GRAPH.md +127 -0
- package/.codebase/context-policy.json +68 -0
- package/.codebase/memories/lessons_learned.md +21 -0
- package/.codebase/memories/preferences.md +17 -0
- package/.codebase/state.json +45 -24
- package/.codex/skills/genesis-architecture/SKILL.md +5 -0
- package/.codex/skills/genesis-debug-guide/SKILL.md +10 -4
- package/.codex/skills/genesis-docs-automation/SKILL.md +52 -973
- package/.codex/skills/genesis-executing-plans/SKILL.md +54 -0
- package/.codex/skills/genesis-executing-plans/agents/openai.yaml +6 -0
- package/.codex/skills/genesis-executing-plans/checklists/.gitkeep +0 -0
- package/.codex/skills/genesis-executing-plans/examples/.gitkeep +0 -0
- package/.codex/skills/genesis-executing-plans/templates/.gitkeep +0 -0
- package/.codex/skills/genesis-harness/SKILL.md +64 -1385
- package/.codex/skills/genesis-harness/scripts/check-docs-sync.sh +3 -3
- package/.codex/skills/genesis-harness/scripts/init-planning.sh +1 -1
- package/.codex/skills/genesis-new-design/SKILL.md +4 -1
- package/.codex/skills/genesis-new-design/agents/openai.yaml +2 -0
- package/.codex/skills/genesis-observability-automation/SKILL.md +69 -303
- package/.codex/skills/genesis-observability-automation/references/common-mistakes-and-recovery.md +84 -0
- package/.codex/skills/genesis-observability-automation/references/workflow-phases.md +78 -0
- package/.codex/skills/genesis-performance-profiling/SKILL.md +1 -22
- package/.codex/skills/genesis-performance-profiling/agents/openai.yaml +1 -1
- package/.codex/skills/genesis-planning/SKILL.md +6 -1
- package/.codex/skills/genesis-release/SKILL.md +5 -0
- package/.codex/skills/genesis-research-first/SKILL.md +6 -0
- package/.codex/skills/genesis-spec-propagation/SKILL.md +52 -504
- package/.codex/skills/genesis-test-driven-development/SKILL.md +55 -0
- package/.codex/skills/genesis-test-driven-development/agents/openai.yaml +6 -0
- package/.codex/skills/genesis-test-driven-development/checklists/.gitkeep +0 -0
- package/.codex/skills/genesis-test-driven-development/examples/.gitkeep +0 -0
- package/.codex/skills/genesis-test-driven-development/templates/.gitkeep +0 -0
- package/.codex/skills/genesis-upgrade-design/SKILL.md +4 -2
- package/.codex/skills/genesis-upgrade-design/agents/openai.yaml +2 -0
- package/.codex/skills/genesis-using-git-worktrees/SKILL.md +54 -0
- package/.codex/skills/genesis-using-git-worktrees/agents/openai.yaml +6 -0
- package/.codex/skills/genesis-using-git-worktrees/checklists/.gitkeep +0 -0
- package/.codex/skills/genesis-using-git-worktrees/examples/.gitkeep +0 -0
- package/.codex/skills/genesis-using-git-worktrees/templates/.gitkeep +0 -0
- package/.codex/skills/genesis-verification-before-completion/SKILL.md +53 -0
- package/.codex/skills/genesis-verification-before-completion/agents/openai.yaml +6 -0
- package/.codex/skills/genesis-verification-before-completion/checklists/.gitkeep +0 -0
- package/.codex/skills/genesis-verification-before-completion/examples/.gitkeep +0 -0
- package/.codex/skills/genesis-verification-before-completion/templates/.gitkeep +0 -0
- package/.codex/skills/spec-impact-engine/SKILL.md +77 -500
- package/.codex/skills/spec-impact-engine/checklists/checklist.md +10 -0
- package/.codex-plugin/plugin.json +3 -4
- package/CHANGELOG.md +4 -1
- package/README.EN.md +32 -17
- package/README.VI.md +35 -19
- package/README.md +48 -10
- package/VERSION +1 -1
- package/bin/genesis-harness.js +735 -5
- package/contracts/features/registry-schema.json +15 -0
- package/contracts/observability/agent-run-schema.json +34 -0
- package/contracts/observability/failure-schema.json +35 -0
- package/contracts/ui/auth/login-screen-contract.json +43 -0
- package/features/REGISTRY.md +63 -0
- package/features/SCOPE-template.md +65 -0
- package/fixtures/planning/MOCKUP_PROMPT_TEMPLATE.md +16 -0
- package/observability/agent-runs/sample-run.json +13 -0
- package/observability/decision-logs/sample-decision.md +43 -0
- package/observability/failures/sample-failure.json +12 -0
- package/package.json +9 -3
- package/playwright/e2e/app-template.spec.js +37 -0
- package/playwright/e2e/auth/login-screen.spec.js +65 -0
- package/playwright/e2e/web-template.spec.js +28 -0
- package/scripts/check-scope.sh +100 -0
- package/scripts/cold-start-check.js +133 -0
- package/scripts/install.sh +4 -0
- package/scripts/prompt_sentinel.js +35 -4
- package/scripts/run-evals.sh +119 -3
- package/scripts/scratch_parser.js +49 -0
- package/scripts/spec_visual_sync.js +1 -1
- package/scripts/test_generator.js +2 -2
- package/scripts/uninstall.sh +4 -0
- package/scripts/verify.sh +16 -1
- package/tests/integration/cli-smoke.test.js +103 -0
- package/tests/unit/feature_registry.test.js +152 -0
- package/tests/unit/prompt_sentinel.test.js +1 -1
- package/tests/unit/spec_visual_sync.test.js +1 -1
- package/tests/unit/test_generator.test.js +1 -1
- package/playwright/e2e/e2e-template.md +0 -4
|
@@ -11,8 +11,9 @@ if ! git rev-parse --is-inside-work-tree >/dev/null 2>&1; then
|
|
|
11
11
|
fi
|
|
12
12
|
|
|
13
13
|
changed="$(git diff --name-only HEAD 2>/dev/null || git diff --name-only)"
|
|
14
|
-
|
|
15
|
-
|
|
14
|
+
docs_pattern='^(\.planning/|\.codebase/|\.codex/SKILLS_INDEX\.md|docs/|README(\.[A-Z]{2})?\.md|AGENTS\.md|CHANGELOG\.md)'
|
|
15
|
+
docs_changed="$(printf '%s\n' "$changed" | grep -E "$docs_pattern" || true)"
|
|
16
|
+
code_changed="$(printf '%s\n' "$changed" | grep -Ev "$docs_pattern" || true)"
|
|
16
17
|
|
|
17
18
|
if [ -n "$code_changed" ] && [ -z "$docs_changed" ]; then
|
|
18
19
|
echo "Code changed but no planning/docs files changed."
|
|
@@ -21,4 +22,3 @@ if [ -n "$code_changed" ] && [ -z "$docs_changed" ]; then
|
|
|
21
22
|
fi
|
|
22
23
|
|
|
23
24
|
echo "Docs sync check passed or no code changes detected."
|
|
24
|
-
|
|
@@ -3,6 +3,7 @@ set -euo pipefail
|
|
|
3
3
|
|
|
4
4
|
confirmed="${PROJECT_BRIEF_CONFIRMED:-0}"
|
|
5
5
|
root="."
|
|
6
|
+
script_source="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
6
7
|
|
|
7
8
|
while [ "$#" -gt 0 ]; do
|
|
8
9
|
case "$1" in
|
|
@@ -741,7 +742,6 @@ EOF
|
|
|
741
742
|
done
|
|
742
743
|
|
|
743
744
|
mkdir -p .planning/scripts
|
|
744
|
-
script_source="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
745
745
|
for script in "$script_source"/*.sh; do
|
|
746
746
|
cp "$script" ".planning/scripts/$(basename "$script")"
|
|
747
747
|
chmod +x ".planning/scripts/$(basename "$script")"
|
|
@@ -72,7 +72,10 @@ If visual output fails, capture screenshot evidence, update the fixture or contr
|
|
|
72
72
|
5. Verify the result:
|
|
73
73
|
- Run the project checks that prove the UI compiles.
|
|
74
74
|
- Start the local dev server when needed.
|
|
75
|
-
-
|
|
75
|
+
- **Automated Visual Verification Loop**:
|
|
76
|
+
1. Use the `@modelcontextprotocol/server-puppeteer` tool to automatically navigate to the local dev server (e.g., `http://localhost:3000`) and capture a screenshot of the current UI.
|
|
77
|
+
2. **Target Matching**: Use your Vision capabilities to compare your screenshot against `mockup.png`. Iterate and update the code until the UI matches 100%.
|
|
78
|
+
3. **Aesthetic Audit**: Analyze your screenshot for aesthetic flaws (alignment, spacing, typography, bad AI-generated slop, inconsistent styles) and autonomously update the code to fix them. Repeat the screenshot capture until the UI looks premium.
|
|
76
79
|
|
|
77
80
|
## Design Rules
|
|
78
81
|
|
|
@@ -7,123 +7,65 @@ description: "Automate observability architecture, monitoring dashboard config,
|
|
|
7
7
|
|
|
8
8
|
## Purpose
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
Automates the full lifecycle of observability for software services: generates architecture diagrams (metrics/logs/traces topology), monitoring dashboard configs, SLO-based alerting policies, health check configuration, and incident response runbooks.
|
|
11
11
|
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
**Core philosophy**: You cannot operate what you cannot see. You cannot respond to incidents you cannot detect. Observability must be designed before production launch, not retrofitted after the first outage. Every service must expose the three pillars (metrics, logs, traces) before it ships.
|
|
12
|
+
**Core philosophy**: You cannot operate what you cannot see. Observability must be designed before production launch. Every service must expose three pillars (metrics, logs, traces) before shipping.
|
|
15
13
|
|
|
16
14
|
---
|
|
17
15
|
|
|
18
16
|
## When to use
|
|
19
17
|
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
-
|
|
23
|
-
-
|
|
24
|
-
-
|
|
25
|
-
-
|
|
26
|
-
- Sprint planning includes observability-related tickets (dashboard, alert, runbook) and you need to generate them efficiently.
|
|
27
|
-
- An SRE or on-call rotation is being established and needs standard runbooks and escalation chains.
|
|
28
|
-
- An audit or compliance review requires documented incident response procedures.
|
|
29
|
-
- You need to validate that an existing service's observability meets a defined maturity level before certifying it as production-ready.
|
|
30
|
-
- A new team member is joining on-call and needs structured runbooks to operate the service safely.
|
|
18
|
+
- A new service approaching production needs observability before launch.
|
|
19
|
+
- A service suffering repeated incidents due to lack of visibility.
|
|
20
|
+
- Migrating monitoring stacks (e.g., from scripts to Prometheus + Grafana).
|
|
21
|
+
- Post-mortem action items include "we need better monitoring" or "we need runbooks."
|
|
22
|
+
- Sprint planning includes observability tickets (dashboard, alert, runbook).
|
|
23
|
+
- An SRE or on-call rotation is being established and needs standard runbooks.
|
|
31
24
|
|
|
32
25
|
---
|
|
33
26
|
|
|
34
27
|
## When NOT to use
|
|
35
28
|
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
-
|
|
39
|
-
-
|
|
40
|
-
-
|
|
41
|
-
- You need to diagnose a currently active incident. Use the existing runbooks and monitoring tools. This skill generates runbooks — it does not replace them during an emergency.
|
|
42
|
-
- The monitoring stack has not been decided yet. Run `genesis-planning` first to select the monitoring stack, then return to this skill.
|
|
43
|
-
- You need network-layer observability (packet capture, flow logs) — this skill covers application-layer observability. Use a dedicated network observability tool (e.g., Wireshark, VPC flow logs) for network-layer issues.
|
|
29
|
+
- Prototype or demo that will never go to production.
|
|
30
|
+
- Quick manual alert on a single metric (use the monitoring tool UI directly).
|
|
31
|
+
- Service already has mature observability and only needs one added metric.
|
|
32
|
+
- During an active incident (use existing runbooks — this skill generates them, not replaces them).
|
|
33
|
+
- Monitoring stack not decided yet (run `genesis-planning` first).
|
|
44
34
|
|
|
45
35
|
---
|
|
46
36
|
|
|
47
37
|
## Inputs required
|
|
48
38
|
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
- **Service endpoints or operations**: Complete list of endpoints/operations to monitor (with HTTP methods if applicable).
|
|
56
|
-
- **Deployment platform**: Kubernetes, ECS, Lambda, Heroku, bare metal (determines probe types and service discovery config).
|
|
57
|
-
|
|
58
|
-
### Monitoring stack inputs
|
|
59
|
-
- **Metrics stack**: Prometheus + Grafana | Datadog | CloudWatch | New Relic | None (select one).
|
|
60
|
-
- **Logging stack**: ELK (Elasticsearch + Logstash + Kibana) | Loki + Grafana | Datadog Logs | CloudWatch Logs | None.
|
|
61
|
-
- **Tracing stack**: Jaeger | Zipkin | AWS X-Ray | Datadog APM | OpenTelemetry (select one).
|
|
62
|
-
- **Alerting tool**: PagerDuty | OpsGenie | Slack | VictorOps | Email (escalation chain target).
|
|
63
|
-
- **On-call rotation**: List of team members in the on-call rotation, in order of escalation.
|
|
64
|
-
|
|
65
|
-
### SLO/SLA inputs
|
|
66
|
-
- **Availability SLO** (e.g., 99.9%): What uptime percentage is required? Determines error budget.
|
|
67
|
-
- **Latency SLO** (e.g., p95 < 200 ms): What response time is acceptable for 95% of requests?
|
|
68
|
-
- **Error rate SLO** (e.g., < 0.1%): What error rate is acceptable?
|
|
69
|
-
- **Throughput minimum** (e.g., ≥ 100 RPS): Minimum throughput for the service to be considered operational.
|
|
70
|
-
|
|
71
|
-
### Incident response inputs
|
|
72
|
-
- **Escalation chain**: Primary on-call → Secondary on-call → Engineering manager → VP Engineering (with contact info / PagerDuty IDs).
|
|
73
|
-
- **Communication channels**: Incident Slack channel, status page URL, customer communication channel.
|
|
74
|
-
- **Service dependencies**: What external services does this service depend on? (Used in runbook dependency checks.)
|
|
75
|
-
- **Rollback procedure**: How is the service rolled back? (kubectl rollout undo, feature flag, etc.)
|
|
76
|
-
- **Business impact**: What is the customer impact if this service is down? (Used for severity classification.)
|
|
39
|
+
| Category | Required inputs |
|
|
40
|
+
|---|---|
|
|
41
|
+
| Service | name, language/runtime, type, endpoints/operations, deployment platform |
|
|
42
|
+
| Monitoring stack | metrics (Prometheus/Datadog/CloudWatch), logging, tracing, alerting tool |
|
|
43
|
+
| SLO/SLA | availability %, latency SLO, error rate SLO, throughput minimum |
|
|
44
|
+
| Incident response | escalation chain, communication channels, service dependencies, rollback procedure |
|
|
77
45
|
|
|
78
46
|
---
|
|
79
47
|
|
|
80
48
|
## Outputs required
|
|
81
49
|
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
- `instrumentation-guide.md
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
- `dashboards/slo-tracking.json`: SLO/error budget burn rate dashboard.
|
|
90
|
-
|
|
91
|
-
### Phase 3 outputs
|
|
92
|
-
- `alerts/alert-rules.yml`: Prometheus alerting rules (or Datadog monitor configs) with SLO-based thresholds.
|
|
93
|
-
- `alerts/escalation-chain.yml`: PagerDuty/OpsGenie escalation policy config.
|
|
94
|
-
- `alerts/alert-silence-template.md`: Template for silencing alerts during planned maintenance.
|
|
95
|
-
|
|
96
|
-
### Phase 4 outputs
|
|
97
|
-
- `health-checks/readiness-probe.yml`: Kubernetes readiness probe configuration.
|
|
98
|
-
- `health-checks/liveness-probe.yml`: Kubernetes liveness probe configuration.
|
|
99
|
-
- `health-checks/health-endpoint-spec.md`: Specification for the `/health`, `/readiness`, and `/metrics` endpoints.
|
|
50
|
+
| Phase | Outputs |
|
|
51
|
+
|---|---|
|
|
52
|
+
| Phase 1 | `observability-architecture.md`, `instrumentation-guide.md` |
|
|
53
|
+
| Phase 2 | `dashboards/service-overview.json`, `service-details.json`, `slo-tracking.json` |
|
|
54
|
+
| Phase 3 | `alerts/alert-rules.yml`, `escalation-chain.yml`, `alert-silence-template.md` |
|
|
55
|
+
| Phase 4 | `health-checks/readiness-probe.yml`, `liveness-probe.yml`, `health-endpoint-spec.md` |
|
|
56
|
+
| Phase 5 | `runbooks/p0-runbook.md`, `p1-runbook.md`, `p2-runbook.md`, `post-mortem-template.md`, `INCIDENT_LOG.md` |
|
|
100
57
|
|
|
101
|
-
|
|
102
|
-
- `runbooks/p0-runbook.md`: Production down — all-hands incident runbook.
|
|
103
|
-
- `runbooks/p1-runbook.md`: Production degraded — on-call incident runbook.
|
|
104
|
-
- `runbooks/p2-runbook.md`: Partial degradation — business hours runbook.
|
|
105
|
-
- `runbooks/post-mortem-template.md`: Blameless post-mortem template with 5-whys.
|
|
106
|
-
- `INCIDENT_LOG.md`: Running log of all incidents (initialize with this skill, append after each incident).
|
|
58
|
+
See `references/workflow-phases.md` for per-phase implementation detail.
|
|
107
59
|
|
|
108
60
|
---
|
|
109
61
|
|
|
110
62
|
## Required tests
|
|
111
63
|
|
|
112
|
-
|
|
113
|
-
-
|
|
114
|
-
-
|
|
115
|
-
-
|
|
116
|
-
|
|
117
|
-
### Dashboard tests
|
|
118
|
-
- [ ] `test/observability/dashboard-schema.test.js`: Validates generated Grafana dashboard JSON against Grafana's schema (all panels have valid datasource refs, correct query syntax).
|
|
119
|
-
- [ ] `test/observability/dashboard-completeness.test.js`: Verifies dashboard has all required panels (rate, errors, duration, saturation).
|
|
120
|
-
|
|
121
|
-
### Alert tests
|
|
122
|
-
- [ ] `test/observability/alert-rules-valid.test.js`: Validates Prometheus alert rules YAML with `promtool check rules alert-rules.yml`.
|
|
123
|
-
- [ ] `test/observability/alert-threshold-coverage.test.js`: Verifies alert rules cover all required SLO burn rate windows (1h, 6h, 24h, 3d).
|
|
124
|
-
|
|
125
|
-
### Runbook tests
|
|
126
|
-
- [ ] `test/observability/runbook-completeness.test.js`: Verifies each runbook has all required sections (severity definition, triage steps, escalation, resolution, post-mortem).
|
|
64
|
+
- `test/observability/instrumentation.test.js` — verifies `/metrics` endpoint exports RED metrics
|
|
65
|
+
- `test/observability/health-endpoint.test.js` — verifies `/health`, `/readiness`, `/liveness` schemas
|
|
66
|
+
- `test/observability/structured-logging.test.js` — verifies all logs are valid JSON with required fields
|
|
67
|
+
- `test/observability/dashboard-schema.test.js` — validates Grafana dashboard JSON schema
|
|
68
|
+
- `test/observability/alert-rules-valid.test.js` — validates Prometheus alert rules with `promtool`
|
|
127
69
|
|
|
128
70
|
All tests must pass against fixtures in `fixtures/observability/`.
|
|
129
71
|
|
|
@@ -131,252 +73,76 @@ All tests must pass against fixtures in `fixtures/observability/`.
|
|
|
131
73
|
|
|
132
74
|
## Required fixtures
|
|
133
75
|
|
|
134
|
-
- `fixtures/observability/monitoring-config-expected.json
|
|
135
|
-
- `fixtures/observability/alert-policy-expected.json
|
|
136
|
-
- `fixtures/observability/incident-runbook-expected.json
|
|
76
|
+
- `fixtures/observability/monitoring-config-expected.json`
|
|
77
|
+
- `fixtures/observability/alert-policy-expected.json`
|
|
78
|
+
- `fixtures/observability/incident-runbook-expected.json`
|
|
137
79
|
|
|
138
80
|
---
|
|
139
81
|
|
|
140
82
|
## Required contract updates
|
|
141
83
|
|
|
142
|
-
Update
|
|
84
|
+
Update when outputs change:
|
|
85
|
+
- `contracts/observability/dashboard-schema.contract.json`
|
|
86
|
+
- `contracts/observability/alert-rule-schema.contract.json`
|
|
87
|
+
- `contracts/observability/health-endpoint.contract.json`
|
|
88
|
+
- `contracts/observability/agent-run-schema.json` (harness-level observability schema)
|
|
143
89
|
|
|
144
|
-
|
|
145
|
-
- `contracts/observability/alert-rule-schema.contract.json`: JSON Schema for alert rule configs. Update when threshold calculation logic changes.
|
|
146
|
-
- `contracts/observability/health-endpoint.contract.json`: API contract for `/health`, `/readiness`, `/liveness` endpoints. Update when health check format changes.
|
|
147
|
-
|
|
148
|
-
Contract update procedure:
|
|
149
|
-
1. Bump `version` field.
|
|
150
|
-
2. Set `changed_at` to current ISO timestamp.
|
|
151
|
-
3. Add `changelog` entry.
|
|
152
|
-
4. Re-run fixture tests.
|
|
153
|
-
5. Notify any consumers of the contract (teams using the generated configs).
|
|
90
|
+
Contract update procedure: bump `version`, set `changed_at`, add `changelog` entry, re-run fixture tests.
|
|
154
91
|
|
|
155
92
|
---
|
|
156
93
|
|
|
157
94
|
## Required codebase map updates
|
|
158
95
|
|
|
159
96
|
After completing observability setup:
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
-
|
|
163
|
-
- Add: `Runbooks: P0/P1/P2 runbooks generated for [service].`
|
|
164
|
-
|
|
165
|
-
### `.codebase/MODULE_INDEX.md`
|
|
166
|
-
- Add entries for dashboard JSON files, alert rule files, and runbook files.
|
|
167
|
-
- Add entries for any new health check endpoints added to the service.
|
|
168
|
-
|
|
169
|
-
### `observability/INCIDENT_LOG.md`
|
|
170
|
-
- Initialize with the service name, observability architecture summary, and `No incidents yet` placeholder.
|
|
97
|
+
- `.codebase/CURRENT_STATE.md`: add `Observability: [service] instrumented [date]`
|
|
98
|
+
- `.codebase/MODULE_INDEX.md`: add entries for dashboard JSON, alert rule, runbook files
|
|
99
|
+
- `observability/INCIDENT_LOG.md`: initialize with service name + `No incidents yet`
|
|
171
100
|
|
|
172
101
|
---
|
|
173
102
|
|
|
174
103
|
## Token saving rules
|
|
175
104
|
|
|
176
|
-
1.
|
|
177
|
-
2.
|
|
178
|
-
3.
|
|
179
|
-
4.
|
|
180
|
-
5.
|
|
181
|
-
6.
|
|
182
|
-
7.
|
|
105
|
+
1. If a dashboard exists, diff it against requirements — do not regenerate the whole dashboard.
|
|
106
|
+
2. Generate only runbooks for severity levels that apply (simple internal tool: P1/P2 only).
|
|
107
|
+
3. Base new alert rules on `templates/alerting-policy-template.md` — fill thresholds, not structure.
|
|
108
|
+
4. Reference `observability-architecture.md` by name in prompts, not by embedding it.
|
|
109
|
+
5. Generate all Grafana panels in one pass — do not loop back for individual panels.
|
|
110
|
+
6. Skip tracing config if no tracing stack selected.
|
|
111
|
+
7. Use compact JSON for dashboard fixtures to reduce token consumption.
|
|
183
112
|
|
|
184
113
|
---
|
|
185
114
|
|
|
186
115
|
## Acceptance criteria
|
|
187
116
|
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
### Instrumentation
|
|
191
|
-
- [ ] Service exports `/metrics` endpoint (Prometheus format) or sends metrics to the configured metrics backend.
|
|
192
|
-
- [ ] All RED metrics are present: `requests_total`, `request_duration_seconds`, `request_errors_total`.
|
|
193
|
-
- [ ] All logs are structured JSON with: `timestamp`, `level`, `service`, `trace_id`, `span_id`, `message`.
|
|
194
|
-
- [ ] Distributed traces are being collected (if tracing stack is configured).
|
|
195
|
-
|
|
196
|
-
### Dashboards
|
|
197
|
-
- [ ] Service overview dashboard exists and is deployed to the monitoring stack.
|
|
198
|
-
- [ ] Dashboard shows: Rate (RPS), Errors (error rate %), Duration (p50/p95/p99), Saturation (CPU, memory, connection pool).
|
|
199
|
-
- [ ] SLO tracking panel shows current error budget remaining.
|
|
200
|
-
- [ ] Dashboard is linked from the service's README or internal wiki.
|
|
201
|
-
|
|
202
|
-
### Alerts
|
|
203
|
-
- [ ] SLO burn rate alerts exist for fast burn (1h/6h windows) and slow burn (24h/3d windows).
|
|
204
|
-
- [ ] All alerts have: `severity` label, `runbook_url` annotation, `description` annotation.
|
|
205
|
-
- [ ] Escalation chain is configured in the alerting tool (PagerDuty/OpsGenie).
|
|
206
|
-
- [ ] At least one alert has been test-fired to verify the escalation chain works end-to-end.
|
|
207
|
-
|
|
208
|
-
### Health checks
|
|
209
|
-
- [ ] `/health` endpoint exists and returns `{"status": "ok"}` with HTTP 200 when healthy.
|
|
210
|
-
- [ ] `/readiness` probe is configured in Kubernetes (or equivalent).
|
|
211
|
-
- [ ] `/liveness` probe is configured with appropriate failure thresholds.
|
|
212
|
-
|
|
213
|
-
### Runbooks
|
|
214
|
-
- [ ] P1 runbook exists and covers: detection, triage, escalation, resolution, post-mortem.
|
|
215
|
-
- [ ] Runbook is linked in all alert `runbook_url` annotations.
|
|
216
|
-
- [ ] Runbook has been reviewed by the on-call team and is accessible without production access (stored in wiki or repo).
|
|
217
|
-
- [ ] Post-mortem template is ready for use.
|
|
218
|
-
|
|
219
|
-
---
|
|
220
|
-
|
|
221
|
-
## Common mistakes
|
|
117
|
+
**Instrumentation**: `/metrics` endpoint exports `requests_total`, `request_duration_seconds`, `request_errors_total`. Logs are structured JSON with `timestamp`, `level`, `service`, `trace_id`.
|
|
222
118
|
|
|
223
|
-
|
|
224
|
-
**Problem**: Alert fires on "CPU > 80%" but CPU being high is a symptom, not a cause. On-call engineer doesn't know what to do.
|
|
225
|
-
**Fix**: Alert on user-facing symptoms (error rate, latency) and provide runbooks that help diagnose the underlying cause. Pair symptom alerts with diagnostic links.
|
|
119
|
+
**Dashboards**: RED metrics (Rate, Errors, Duration) + Saturation panels. SLO tracking panel shows error budget remaining.
|
|
226
120
|
|
|
227
|
-
|
|
228
|
-
**Problem**: 50 alerts firing every week, most of which are noise. On-call engineers start ignoring alerts ("boy who cried wolf").
|
|
229
|
-
**Fix**: Start with only SLO-based alerts. Achieve < 5 actionable alerts per week. Every alert must have a runbook and a clear action. Regularly review false positive rates.
|
|
121
|
+
**Alerts**: SLO burn rate alerts for 1h/6h/24h/3d windows. All alerts have `severity`, `runbook_url`, `description`. Escalation chain test-fired end-to-end.
|
|
230
122
|
|
|
231
|
-
|
|
232
|
-
**Problem**: Dashboard shows a graph going up but no reference line to know if that's good or bad.
|
|
233
|
-
**Fix**: Every metric panel must have a reference line or annotation showing the SLA target, the previous week's baseline, or an absolute threshold. "Is this normal?" should be answerable from the dashboard alone.
|
|
123
|
+
**Health checks**: `/health` → 200 OK. `/readiness` → checks actual dependencies. K8s probes configured.
|
|
234
124
|
|
|
235
|
-
|
|
236
|
-
**Problem**: Alert only fires when error rate > 5% for 5 minutes. A slow 0.5% burn for 48 hours exhausts the entire monthly error budget without triggering any alert.
|
|
237
|
-
**Fix**: Implement multi-window alerting: fast burn (≥ 2% in 1 hour), medium burn (≥ 5% in 6 hours), slow burn (≥ 10% in 3 days). Cover both fast and slow failures.
|
|
125
|
+
**Runbooks**: P1 runbook covers detection/triage/escalation/resolution/post-mortem. Linked from alert `runbook_url`. Reviewed by on-call team.
|
|
238
126
|
|
|
239
|
-
|
|
240
|
-
**Problem**: Runbook says "check the logs" without specifying where, how, or what to look for. New on-call engineer is lost.
|
|
241
|
-
**Fix**: Write runbooks for the most junior person on the rotation. Include exact commands to run, exact queries to execute, and exact thresholds that indicate each diagnosis. Link to the monitoring dashboard directly.
|
|
127
|
+
---
|
|
242
128
|
|
|
243
|
-
|
|
244
|
-
**Problem**: `/health` endpoint returns HTTP 200 even when the database is unreachable. Kubernetes load balancer continues routing traffic to a broken pod.
|
|
245
|
-
**Fix**: Health check must verify actual service dependencies. `/readiness` should check DB connectivity, cache connectivity, and any critical downstream dependencies. Return 503 if any dependency is unhealthy.
|
|
129
|
+
## Common mistakes
|
|
246
130
|
|
|
247
|
-
|
|
248
|
-
**Problem**: Monitoring is only set up for production. Issues are invisible in staging and only discovered after production deployment.
|
|
249
|
-
**Fix**: Deploy the same observability stack in staging. Run integration tests against the `/metrics` endpoint. Validate alert rules in staging before production.
|
|
131
|
+
See `references/common-mistakes-and-recovery.md` for the full list of 8 mistakes with fixes.
|
|
250
132
|
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
133
|
+
**Top 3**:
|
|
134
|
+
- M1: Alerting on symptoms (CPU high) not user-facing signals (error rate, latency)
|
|
135
|
+
- M4: Missing multi-window alerting — slow burn exhausts budget without triggering alert
|
|
136
|
+
- M6: Health checks always returning 200 even when dependencies are down
|
|
254
137
|
|
|
255
138
|
---
|
|
256
139
|
|
|
257
140
|
## Recovery workflow
|
|
258
141
|
|
|
259
|
-
|
|
260
|
-
```
|
|
261
|
-
Symptom: Dashboard shows "No data" for all panels.
|
|
262
|
-
Step 1: Verify service is running: kubectl get pods -n [namespace]
|
|
263
|
-
Step 2: Check /metrics endpoint directly: curl http://service-host:9090/metrics | grep http_requests_total
|
|
264
|
-
Step 3: Check Prometheus scrape config: kubectl describe servicemonitor [name] -n monitoring
|
|
265
|
-
Step 4: Check Prometheus targets: open Prometheus UI → Status → Targets → look for service in targets list
|
|
266
|
-
Step 5: If service is in targets but metrics missing: check instrumentation code (is metrics library initialized before first request?)
|
|
267
|
-
Step 6: If service is NOT in targets: check Prometheus scrape config selector labels match service labels
|
|
268
|
-
```
|
|
269
|
-
|
|
270
|
-
### Recovery 2: Alert not firing when it should
|
|
271
|
-
```
|
|
272
|
-
Symptom: Error rate is clearly high but no alert fired.
|
|
273
|
-
Step 1: Check Prometheus alert rule with: promtool check rules alert-rules.yml
|
|
274
|
-
Step 2: Evaluate the alert expression manually in Prometheus: Prometheus UI → Graph → paste alert expression
|
|
275
|
-
Step 3: Check alert state: Prometheus UI → Alerts → find the alert rule → check if it's in "Inactive" state
|
|
276
|
-
Step 4: If alert is firing but not notifying: check Alertmanager config and routing rules
|
|
277
|
-
Step 5: Check Alertmanager status: kubectl exec -n monitoring alertmanager-pod -- amtool alert
|
|
278
|
-
Step 6: Test escalation manually: send a test alert through PagerDuty/OpsGenie UI
|
|
279
|
-
```
|
|
280
|
-
|
|
281
|
-
### Recovery 3: Health check causing false pod restarts
|
|
282
|
-
```
|
|
283
|
-
Symptom: Pods are being killed by liveness probe even though service is working.
|
|
284
|
-
Step 1: Check liveness probe config: kubectl describe pod [pod-name] | grep -A 10 Liveness
|
|
285
|
-
Step 2: Check if probe timeout is too short (default is 1s — increase if health check queries DB)
|
|
286
|
-
Step 3: Check if failure threshold is too low (default is 3 consecutive failures — may be too aggressive)
|
|
287
|
-
Step 4: Check /health endpoint response time under load: does it exceed liveness probe timeout?
|
|
288
|
-
Step 5: Fix: increase timeoutSeconds, increase failureThreshold, or optimize the health check endpoint
|
|
289
|
-
Step 6: Recommended safe config: initialDelaySeconds: 30, timeoutSeconds: 5, failureThreshold: 5
|
|
290
|
-
```
|
|
291
|
-
|
|
292
|
-
### Recovery 4: Runbook is wrong or out of date
|
|
293
|
-
```
|
|
294
|
-
Symptom: On-call engineer followed runbook but steps don't work or are inaccurate.
|
|
295
|
-
Step 1: Immediately annotate the incorrect step with [OUTDATED: <brief note>] so next person is warned.
|
|
296
|
-
Step 2: After incident is resolved, open a PR to correct the runbook.
|
|
297
|
-
Step 3: Re-run the corrected runbook steps in a test environment to verify they work.
|
|
298
|
-
Step 4: Add incident to post-mortem action items: "Update runbook [name] step [N]."
|
|
299
|
-
Step 5: Assign runbook review to the on-call engineer who caught the error.
|
|
300
|
-
```
|
|
301
|
-
|
|
302
|
-
---
|
|
303
|
-
|
|
304
|
-
## Workflow Detail: Phase-by-Phase Execution
|
|
305
|
-
|
|
306
|
-
### Phase 1: Observability Architecture Generation
|
|
307
|
-
|
|
308
|
-
**Goal**: Design and document the complete observability topology before writing any configuration.
|
|
309
|
-
|
|
310
|
-
**Architecture components to define:**
|
|
311
|
-
|
|
312
|
-
| Pillar | Component | Purpose |
|
|
313
|
-
|--------|-----------|---------|
|
|
314
|
-
| Metrics | Prometheus / Datadog Agent | Scrape and store numeric time-series |
|
|
315
|
-
| Metrics | Grafana / Datadog Dashboards | Visualize and alert on metrics |
|
|
316
|
-
| Logs | Structured logging library | Produce machine-readable log events |
|
|
317
|
-
| Logs | Log aggregator (Loki/ELK/CloudWatch) | Collect and index logs |
|
|
318
|
-
| Logs | Kibana/Grafana/Datadog | Search and visualize logs |
|
|
319
|
-
| Traces | OpenTelemetry SDK | Instrument service for tracing |
|
|
320
|
-
| Traces | Jaeger/Zipkin/Datadog APM | Collect and visualize traces |
|
|
321
|
-
|
|
322
|
-
**Service instrumentation requirements (by language):**
|
|
323
|
-
- Node.js: `prom-client` (metrics), `winston` or `pino` (structured logs), `@opentelemetry/sdk-node` (traces).
|
|
324
|
-
- Python: `prometheus_client` (metrics), `structlog` or `python-json-logger` (logs), `opentelemetry-sdk` (traces).
|
|
325
|
-
- Go: `prometheus/client_golang` (metrics), `zap` or `logrus` (logs), `go.opentelemetry.io/otel` (traces).
|
|
326
|
-
|
|
327
|
-
### Phase 2: Dashboard Generation
|
|
328
|
-
|
|
329
|
-
**Required panels for every service dashboard:**
|
|
330
|
-
|
|
331
|
-
RED metrics (the minimum viable dashboard for any service):
|
|
332
|
-
- **Rate**: Requests per second (total and per endpoint).
|
|
333
|
-
- **Errors**: Error rate percentage (4xx and 5xx separately).
|
|
334
|
-
- **Duration**: Response time as a histogram with p50, p95, p99 lines.
|
|
335
|
-
|
|
336
|
-
SATURATION metrics (resource utilization):
|
|
337
|
-
- **CPU**: Process CPU utilization %.
|
|
338
|
-
- **Memory**: Heap and RSS memory.
|
|
339
|
-
- **Connection pool**: Active connections vs. pool limit.
|
|
340
|
-
- **Queue depth**: (For background workers) — job queue length.
|
|
341
|
-
|
|
342
|
-
See `templates/monitoring-dashboard-template.md` for complete Grafana JSON scaffold.
|
|
343
|
-
|
|
344
|
-
### Phase 3: Alerting Policy Generation
|
|
345
|
-
|
|
346
|
-
**SLO-based alert threshold calculation:**
|
|
347
|
-
|
|
348
|
-
For a 99.9% availability SLO (monthly error budget = 43.8 minutes):
|
|
349
|
-
|
|
350
|
-
```
|
|
351
|
-
Fast burn alert (1h window):
|
|
352
|
-
Threshold: error_rate > 2% for 1 hour
|
|
353
|
-
Reason: 2% error rate burns 2% of monthly budget per hour = exhausted in 50 hours
|
|
354
|
-
Action: Page on-call immediately (P1)
|
|
355
|
-
|
|
356
|
-
Medium burn alert (6h window):
|
|
357
|
-
Threshold: error_rate > 0.5% for 6 hours
|
|
358
|
-
Reason: 0.5% × 6h = 3% of monthly budget consumed
|
|
359
|
-
Action: Page on-call (P2 — business hours response acceptable)
|
|
360
|
-
|
|
361
|
-
Slow burn alert (3d window):
|
|
362
|
-
Threshold: error_rate > 0.1% for 72 hours
|
|
363
|
-
Reason: 0.1% × 72h = 7.2% of monthly budget consumed silently
|
|
364
|
-
Action: Slack notification + ticket creation (investigate next sprint)
|
|
365
|
-
```
|
|
366
|
-
|
|
367
|
-
See `templates/alerting-policy-template.md` for complete Prometheus alerting rules.
|
|
368
|
-
|
|
369
|
-
### Phase 4: Health Check Automation
|
|
370
|
-
|
|
371
|
-
**Standard health endpoint specification:**
|
|
372
|
-
- `GET /health` → 200 OK always (used by load balancers for basic routing).
|
|
373
|
-
- `GET /readiness` → 200 if all dependencies healthy, 503 if any dependency unhealthy.
|
|
374
|
-
- `GET /liveness` → 200 if process is alive and event loop is not stuck, 503 if deadlocked.
|
|
375
|
-
- `GET /metrics` → Prometheus text format metrics.
|
|
376
|
-
|
|
377
|
-
### Phase 5: Incident Response Runbook Generation
|
|
378
|
-
|
|
379
|
-
**Runbook structure requirements:**
|
|
380
|
-
Every runbook must have: Severity definition, Detection signals, Triage steps (ordered, with commands), Escalation triggers, Resolution steps, Rollback procedure, Communication templates, Post-mortem checklist.
|
|
142
|
+
See `references/common-mistakes-and-recovery.md` for the full recovery playbooks R1–R4.
|
|
381
143
|
|
|
382
|
-
|
|
144
|
+
**Quick reference**:
|
|
145
|
+
- R1: Metrics missing → check `/metrics` endpoint, Prometheus targets, scrape config labels
|
|
146
|
+
- R2: Alert not firing → `promtool check rules`, check Alertmanager routing
|
|
147
|
+
- R3: False pod restarts → increase `timeoutSeconds` and `failureThreshold` in liveness probe
|
|
148
|
+
- R4: Runbook outdated → annotate `[OUTDATED]`, fix post-incident, re-run in test environment
|
package/.codex/skills/genesis-observability-automation/references/common-mistakes-and-recovery.md
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
# Observability — Common Mistakes & Recovery Playbook
|
|
2
|
+
|
|
3
|
+
Tài liệu chi tiết về lỗi phổ biến và cách phục hồi.
|
|
4
|
+
Được gọi bởi `genesis-observability-automation/SKILL.md` → `## Common mistakes` và `## Recovery workflow`.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Common Mistakes
|
|
9
|
+
|
|
10
|
+
### M1: Alerting on symptoms without causes
|
|
11
|
+
Alert trên "CPU > 80%" nhưng CPU cao là triệu chứng, không phải nguyên nhân.
|
|
12
|
+
**Fix**: Alert trên user-facing symptoms (error rate, latency). Pair với runbooks chẩn đoán.
|
|
13
|
+
|
|
14
|
+
### M2: Alert fatigue from too many low-quality alerts
|
|
15
|
+
50 alerts/week — hầu hết là noise. On-call bắt đầu ignore.
|
|
16
|
+
**Fix**: Chỉ dùng SLO-based alerts. Mục tiêu < 5 actionable alerts/week.
|
|
17
|
+
|
|
18
|
+
### M3: Dashboards without context
|
|
19
|
+
Graph đi lên nhưng không biết đó tốt hay xấu.
|
|
20
|
+
**Fix**: Mỗi metric panel phải có reference line hoặc SLA target annotation.
|
|
21
|
+
|
|
22
|
+
### M4: Missing "long tail" in alerting windows
|
|
23
|
+
Alert chỉ fires khi error > 5% trong 5 phút. Slow burn 0.5% trong 48 giờ exhausts budget silently.
|
|
24
|
+
**Fix**: Multi-window alerting: fast burn (1h), medium burn (6h), slow burn (3d).
|
|
25
|
+
|
|
26
|
+
### M5: Runbooks only the author can follow
|
|
27
|
+
Runbook nói "check the logs" không chỉ rõ ở đâu, cách nào.
|
|
28
|
+
**Fix**: Viết cho người junior nhất trong team. Include exact commands, queries, thresholds.
|
|
29
|
+
|
|
30
|
+
### M6: Health checks that always return 200
|
|
31
|
+
`/health` trả 200 dù DB unreachable. K8s vẫn route traffic đến broken pod.
|
|
32
|
+
**Fix**: Health check phải verify actual dependencies. Return 503 nếu dependency unhealthy.
|
|
33
|
+
|
|
34
|
+
### M7: Observability only in production
|
|
35
|
+
Monitoring chỉ ở prod — issues invisible ở staging.
|
|
36
|
+
**Fix**: Deploy cùng observability stack ở staging. Run integration tests vs `/metrics` endpoint.
|
|
37
|
+
|
|
38
|
+
### M8: Missing trace context in logs
|
|
39
|
+
Logs không có `trace_id`/`span_id`. Không thể correlate request across microservices.
|
|
40
|
+
**Fix**: Inject `trace_id` và `span_id` vào mọi log line via OpenTelemetry.
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## Recovery Playbook
|
|
45
|
+
|
|
46
|
+
### R1: Metrics not appearing in dashboard
|
|
47
|
+
```
|
|
48
|
+
Symptom: Dashboard shows "No data" for all panels.
|
|
49
|
+
1. Verify service running: kubectl get pods -n [namespace]
|
|
50
|
+
2. Check /metrics directly: curl http://service:9090/metrics | grep http_requests_total
|
|
51
|
+
3. Check Prometheus scrape config: kubectl describe servicemonitor [name] -n monitoring
|
|
52
|
+
4. Check Prometheus targets: Prometheus UI → Status → Targets
|
|
53
|
+
5. If in targets but missing: check instrumentation code init order
|
|
54
|
+
6. If NOT in targets: check scrape config selector labels
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### R2: Alert not firing when it should
|
|
58
|
+
```
|
|
59
|
+
Symptom: Error rate clearly high but no alert.
|
|
60
|
+
1. Validate rules: promtool check rules alert-rules.yml
|
|
61
|
+
2. Evaluate expression manually: Prometheus UI → Graph
|
|
62
|
+
3. Check alert state: Prometheus UI → Alerts
|
|
63
|
+
4. If firing but not notifying: check Alertmanager routing
|
|
64
|
+
5. Check Alertmanager: kubectl exec alertmanager -- amtool alert
|
|
65
|
+
6. Test escalation manually via PagerDuty/OpsGenie UI
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### R3: Health check causing false pod restarts
|
|
69
|
+
```
|
|
70
|
+
Symptom: Pods killed by liveness probe though service works.
|
|
71
|
+
1. Check probe: kubectl describe pod [pod] | grep -A 10 Liveness
|
|
72
|
+
2. Increase timeout if probe queries DB (default 1s often too short)
|
|
73
|
+
3. Increase failureThreshold if too aggressive (default 3)
|
|
74
|
+
4. Recommended safe config: initialDelaySeconds:30, timeoutSeconds:5, failureThreshold:5
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
### R4: Runbook out of date
|
|
78
|
+
```
|
|
79
|
+
Symptom: On-call followed runbook but steps don't work.
|
|
80
|
+
1. Annotate incorrect step immediately: [OUTDATED: <brief note>]
|
|
81
|
+
2. After incident resolved, open PR to correct
|
|
82
|
+
3. Re-run corrected steps in test environment
|
|
83
|
+
4. Add to post-mortem: "Update runbook [name] step [N]"
|
|
84
|
+
```
|