hatch3r 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (132) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +437 -0
  3. package/agents/hatch3r-a11y-auditor.md +126 -0
  4. package/agents/hatch3r-architect.md +160 -0
  5. package/agents/hatch3r-ci-watcher.md +123 -0
  6. package/agents/hatch3r-context-rules.md +97 -0
  7. package/agents/hatch3r-dependency-auditor.md +164 -0
  8. package/agents/hatch3r-devops.md +138 -0
  9. package/agents/hatch3r-docs-writer.md +97 -0
  10. package/agents/hatch3r-implementer.md +162 -0
  11. package/agents/hatch3r-learnings-loader.md +108 -0
  12. package/agents/hatch3r-lint-fixer.md +104 -0
  13. package/agents/hatch3r-perf-profiler.md +123 -0
  14. package/agents/hatch3r-researcher.md +642 -0
  15. package/agents/hatch3r-reviewer.md +81 -0
  16. package/agents/hatch3r-security-auditor.md +119 -0
  17. package/agents/hatch3r-test-writer.md +134 -0
  18. package/commands/hatch3r-agent-customize.md +146 -0
  19. package/commands/hatch3r-api-spec.md +49 -0
  20. package/commands/hatch3r-benchmark.md +50 -0
  21. package/commands/hatch3r-board-fill.md +504 -0
  22. package/commands/hatch3r-board-init.md +315 -0
  23. package/commands/hatch3r-board-pickup.md +672 -0
  24. package/commands/hatch3r-board-refresh.md +198 -0
  25. package/commands/hatch3r-board-shared.md +369 -0
  26. package/commands/hatch3r-bug-plan.md +410 -0
  27. package/commands/hatch3r-codebase-map.md +1182 -0
  28. package/commands/hatch3r-command-customize.md +94 -0
  29. package/commands/hatch3r-context-health.md +112 -0
  30. package/commands/hatch3r-cost-tracking.md +139 -0
  31. package/commands/hatch3r-dep-audit.md +171 -0
  32. package/commands/hatch3r-feature-plan.md +379 -0
  33. package/commands/hatch3r-healthcheck.md +307 -0
  34. package/commands/hatch3r-hooks.md +282 -0
  35. package/commands/hatch3r-learn.md +217 -0
  36. package/commands/hatch3r-migration-plan.md +51 -0
  37. package/commands/hatch3r-onboard.md +56 -0
  38. package/commands/hatch3r-project-spec.md +1153 -0
  39. package/commands/hatch3r-recipe.md +179 -0
  40. package/commands/hatch3r-refactor-plan.md +426 -0
  41. package/commands/hatch3r-release.md +328 -0
  42. package/commands/hatch3r-roadmap.md +556 -0
  43. package/commands/hatch3r-rule-customize.md +114 -0
  44. package/commands/hatch3r-security-audit.md +370 -0
  45. package/commands/hatch3r-skill-customize.md +93 -0
  46. package/commands/hatch3r-workflow.md +377 -0
  47. package/dist/cli/hooks-ZOTFDEA3.js +59 -0
  48. package/dist/cli/index.d.ts +2 -0
  49. package/dist/cli/index.js +3584 -0
  50. package/github-agents/hatch3r-docs-agent.md +46 -0
  51. package/github-agents/hatch3r-lint-agent.md +41 -0
  52. package/github-agents/hatch3r-security-agent.md +54 -0
  53. package/github-agents/hatch3r-test-agent.md +66 -0
  54. package/hooks/hatch3r-ci-failure.md +10 -0
  55. package/hooks/hatch3r-file-save.md +11 -0
  56. package/hooks/hatch3r-post-merge.md +10 -0
  57. package/hooks/hatch3r-pre-commit.md +11 -0
  58. package/hooks/hatch3r-pre-push.md +10 -0
  59. package/hooks/hatch3r-session-start.md +10 -0
  60. package/mcp/mcp.json +62 -0
  61. package/package.json +84 -0
  62. package/prompts/hatch3r-bug-triage.md +155 -0
  63. package/prompts/hatch3r-code-review.md +131 -0
  64. package/prompts/hatch3r-pr-description.md +173 -0
  65. package/rules/hatch3r-accessibility-standards.md +77 -0
  66. package/rules/hatch3r-accessibility-standards.mdc +75 -0
  67. package/rules/hatch3r-agent-orchestration.md +160 -0
  68. package/rules/hatch3r-api-design.md +176 -0
  69. package/rules/hatch3r-api-design.mdc +176 -0
  70. package/rules/hatch3r-browser-verification.md +73 -0
  71. package/rules/hatch3r-browser-verification.mdc +73 -0
  72. package/rules/hatch3r-ci-cd.md +70 -0
  73. package/rules/hatch3r-ci-cd.mdc +68 -0
  74. package/rules/hatch3r-code-standards.md +102 -0
  75. package/rules/hatch3r-code-standards.mdc +100 -0
  76. package/rules/hatch3r-component-conventions.md +102 -0
  77. package/rules/hatch3r-component-conventions.mdc +102 -0
  78. package/rules/hatch3r-data-classification.md +85 -0
  79. package/rules/hatch3r-data-classification.mdc +83 -0
  80. package/rules/hatch3r-dependency-management.md +17 -0
  81. package/rules/hatch3r-dependency-management.mdc +15 -0
  82. package/rules/hatch3r-error-handling.md +17 -0
  83. package/rules/hatch3r-error-handling.mdc +15 -0
  84. package/rules/hatch3r-feature-flags.md +112 -0
  85. package/rules/hatch3r-feature-flags.mdc +112 -0
  86. package/rules/hatch3r-git-conventions.md +47 -0
  87. package/rules/hatch3r-git-conventions.mdc +45 -0
  88. package/rules/hatch3r-i18n.md +90 -0
  89. package/rules/hatch3r-i18n.mdc +90 -0
  90. package/rules/hatch3r-learning-consult.md +29 -0
  91. package/rules/hatch3r-learning-consult.mdc +27 -0
  92. package/rules/hatch3r-migrations.md +17 -0
  93. package/rules/hatch3r-migrations.mdc +15 -0
  94. package/rules/hatch3r-observability.md +165 -0
  95. package/rules/hatch3r-observability.mdc +165 -0
  96. package/rules/hatch3r-performance-budgets.md +109 -0
  97. package/rules/hatch3r-performance-budgets.mdc +109 -0
  98. package/rules/hatch3r-secrets-management.md +76 -0
  99. package/rules/hatch3r-secrets-management.mdc +74 -0
  100. package/rules/hatch3r-security-patterns.md +211 -0
  101. package/rules/hatch3r-security-patterns.mdc +211 -0
  102. package/rules/hatch3r-testing.md +89 -0
  103. package/rules/hatch3r-testing.mdc +87 -0
  104. package/rules/hatch3r-theming.md +51 -0
  105. package/rules/hatch3r-theming.mdc +51 -0
  106. package/rules/hatch3r-tooling-hierarchy.md +92 -0
  107. package/rules/hatch3r-tooling-hierarchy.mdc +79 -0
  108. package/skills/hatch3r-a11y-audit/SKILL.md +131 -0
  109. package/skills/hatch3r-agent-customize/SKILL.md +75 -0
  110. package/skills/hatch3r-api-spec/SKILL.md +66 -0
  111. package/skills/hatch3r-architecture-review/SKILL.md +96 -0
  112. package/skills/hatch3r-bug-fix/SKILL.md +129 -0
  113. package/skills/hatch3r-ci-pipeline/SKILL.md +76 -0
  114. package/skills/hatch3r-command-customize/SKILL.md +67 -0
  115. package/skills/hatch3r-context-health/SKILL.md +76 -0
  116. package/skills/hatch3r-cost-tracking/SKILL.md +65 -0
  117. package/skills/hatch3r-dep-audit/SKILL.md +82 -0
  118. package/skills/hatch3r-feature/SKILL.md +129 -0
  119. package/skills/hatch3r-gh-agentic-workflows/SKILL.md +150 -0
  120. package/skills/hatch3r-incident-response/SKILL.md +86 -0
  121. package/skills/hatch3r-issue-workflow/SKILL.md +139 -0
  122. package/skills/hatch3r-logical-refactor/SKILL.md +73 -0
  123. package/skills/hatch3r-migration/SKILL.md +76 -0
  124. package/skills/hatch3r-perf-audit/SKILL.md +114 -0
  125. package/skills/hatch3r-pr-creation/SKILL.md +85 -0
  126. package/skills/hatch3r-qa-validation/SKILL.md +86 -0
  127. package/skills/hatch3r-recipe/SKILL.md +67 -0
  128. package/skills/hatch3r-refactor/SKILL.md +86 -0
  129. package/skills/hatch3r-release/SKILL.md +93 -0
  130. package/skills/hatch3r-rule-customize/SKILL.md +70 -0
  131. package/skills/hatch3r-skill-customize/SKILL.md +67 -0
  132. package/skills/hatch3r-visual-refactor/SKILL.md +89 -0
@@ -0,0 +1,165 @@
1
+ ---
2
+ description: Logging, metrics, and tracing conventions for the project
3
+ alwaysApply: false
4
+ ---
5
+ # Observability
6
+
7
+ ## Structured Logging
8
+
9
+ - Use structured JSON logging. No `console.log` in production code.
10
+ - Log levels: `error` (failures), `warn` (degraded), `info` (state changes), `debug` (dev only).
11
+ - Every log entry includes `correlationId` and `userId` (if available).
12
+ - Never log secrets, PII, tokens, passwords, or sensitive content.
13
+ - Instrument key operations with timing metrics. Serverless functions log execution time and outcome.
14
+ - Client-side: log errors to a sink (e.g., error reporting service), not just `console.error`.
15
+ - Prefer event-based metrics over polling. Trace user flows end-to-end with `correlationId`.
16
+ - Respect performance budgets: logging must not add > 10ms latency to hot paths.
17
+ - Include `service`, `environment`, and `version` fields in every log entry for filtering.
18
+ - Use log sampling for high-volume debug logs in production (e.g., 1% sample rate).
19
+
20
+ ## Distributed Tracing
21
+
22
+ - Use OpenTelemetry SDK for all tracing instrumentation. Initialize the TracerProvider once at application startup before any instrumented libraries load.
23
+ - Propagate trace context via W3C Trace Context headers (`traceparent`, `tracestate`) across all service boundaries, queues, and async workflows.
24
+ - Span naming conventions:
25
+
26
+ | Span Type | Pattern | Example |
27
+ | ----------- | ------------------------------ | --------------------------- |
28
+ | HTTP server | `HTTP {method} {route}` | `HTTP GET /api/users/:id` |
29
+ | HTTP client | `HTTP {method} {host}{path}` | `HTTP POST api.stripe.com/` |
30
+ | DB query | `{db.system} {operation}` | `firestore getDoc` |
31
+ | Queue | `{queue} {operation}` | `tasks-queue publish` |
32
+ | Internal | `{module}.{function}` | `auth.verifyToken` |
33
+
34
+ - Required span attributes: `service.name`, `service.version`, `deployment.environment`. Add domain-specific attributes (e.g., `user.id`, `tenant.id`) where relevant.
35
+ - Parent-child span relationships: every outbound call (HTTP, DB, queue) creates a child span of the current context. Never create orphan spans.
36
+ - Sampling strategies: use `ParentBased(TraceIdRatioBased(0.1))` in production (10% sample rate). Always sample errors and slow requests (> p95 latency) at 100%.
37
+ - Use the OpenTelemetry Collector as a gateway between applications and backends to enable batching, retrying, and vendor-neutral export.
38
+ - Keep span event count low (< 32 per span). For high-volume events, use correlated logs or `SpanLink` instead.
39
+
40
+ ## Metrics
41
+
42
+ - Use OpenTelemetry Metrics SDK. Expose Prometheus-compatible `/metrics` endpoint for scraping where applicable.
43
+ - Metric naming: `{service}.{domain}.{metric}_{unit}` in snake_case. Example: `api.auth.login_duration_ms`.
44
+ - Instrument types and when to use:
45
+
46
+ | Instrument | Use Case | Example |
47
+ | ----------- | ---------------------------------- | -------------------------------- |
48
+ | Counter | Monotonically increasing totals | `http.requests_total` |
49
+ | Histogram | Distributions (latency, size) | `http.request_duration_ms` |
50
+ | Gauge | Point-in-time values | `db.connection_pool_active` |
51
+ | UpDownCounter | Values that increase and decrease | `queue.messages_pending` |
52
+
53
+ - Histogram buckets for latency: `[5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]` ms.
54
+ - Cardinality management: never use unbounded values (user IDs, request paths with params) as metric labels. Cap label cardinality to < 100 unique values per metric.
55
+ - Custom business metrics: track domain-significant events (sign-ups, purchases, feature usage) as counters with relevant dimensions.
56
+
57
+ ## SLO / SLI Definitions
58
+
59
+ - Define SLIs as ratios of good events to total events, measured from the user's perspective.
60
+ - Standard SLIs:
61
+
62
+ | SLI | Definition | Measurement Source |
63
+ | ---------------- | --------------------------------------------- | ------------------------ |
64
+ | Availability | Requests returning non-5xx / total requests | Load balancer logs |
65
+ | Latency | Requests completing < threshold / total | Tracing p99 |
66
+ | Error rate | Failed operations / total operations | Application metrics |
67
+ | Freshness | Data updated within SLA / total records | Background job metrics |
68
+
69
+ - SLO targets: set per-service. Typical starting points: 99.9% availability (43 min/month budget), p99 latency < 500ms.
70
+ - Error budgets: `budget = 1 - SLO_target`. Track remaining budget on a rolling 30-day window.
71
+ - Burn rate alerts: use multi-window approach (short + long window). Fast-burn alert: 2% budget consumed in 1 hour. Slow-burn alert: 5% consumed in 6 hours. Alert only when both windows confirm.
72
+
73
+ ## Alerting
74
+
75
+ | Severity | Criteria | Response Time | Notification |
76
+ | -------- | ----------------------------------- | ------------- | ------------------- |
77
+ | P1 | Service down, data loss risk | 15 min | Page on-call + Slack |
78
+ | P2 | Degraded performance, SLO at risk | 1 hour | Page on-call |
79
+ | P3 | Non-critical issue, workaround exists | Next business day | Slack channel |
80
+ | P4 | Cosmetic / low-impact | Sprint backlog | Ticket only |
81
+
82
+ - Every alert must link to a runbook with: symptoms, likely causes, diagnostic steps, remediation actions.
83
+ - Alert fatigue prevention: tune thresholds to < 5 actionable alerts per on-call shift. Suppress duplicate alerts within a 10-minute dedup window.
84
+ - Route alerts by service ownership. Use escalation policies: if P1/P2 unacknowledged in 15 min, escalate to secondary.
85
+ - Review alert quality monthly: snooze/delete alerts with < 20% action rate.
86
+
87
+ ## Structured Error Reporting
88
+
89
+ - Integrate Sentry (or equivalent) for automated error capture in both server and client environments.
90
+ - Configure release tracking: tag errors with `release` (git SHA or semver) and upload source maps for readable stack traces.
91
+ - Enable breadcrumbs: capture the last 50 user actions, network requests, and console messages leading to an error.
92
+ - Error grouping: use custom fingerprints for domain-specific errors to prevent over-grouping. Default fingerprinting is acceptable for unhandled exceptions.
93
+ - Enrich error context with `correlationId`, `userId`, environment, and relevant business state. Never attach PII or secrets.
94
+ - Set sample rates: 100% for errors, 10% for transactions in production. Adjust based on volume and budget.
95
+
96
+ ## Dashboard Standards
97
+
98
+ - Required dashboards per service:
99
+
100
+ | Dashboard | Contents |
101
+ | ---------------- | ----------------------------------------------------------- |
102
+ | Service Health | Request rate, error rate, latency p50/p95/p99, saturation |
103
+ | Business Metrics | Key domain counters, conversion funnels, feature adoption |
104
+ | Dependencies | Upstream/downstream latency, error rates, circuit breaker state |
105
+ | Infrastructure | CPU, memory, disk, connection pools, queue depth |
106
+
107
+ - Dashboard-as-code: define dashboards in version-controlled JSON/YAML (Grafana provisioning, Terraform, or equivalent). No manual dashboard creation in production.
108
+ - Every dashboard panel includes: descriptive title, unit labels, threshold lines for SLO targets, and a link to the relevant runbook or alert.
109
+ - Review dashboards quarterly: remove unused panels, update thresholds, verify data source accuracy.
110
+
111
+ ## OpenTelemetry Semantic Conventions
112
+
113
+ Follow the [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) (v1.29+) for consistent attribute naming across all telemetry signals. Semantic conventions ensure interoperability between instrumentation libraries, collectors, and observability backends.
114
+
115
+ ### Standard Attribute Namespaces
116
+
117
+ | Namespace | Scope | Key Attributes |
118
+ |-----------|-------|----------------|
119
+ | `http.*` | HTTP client and server spans | `http.request.method`, `http.response.status_code`, `http.route`, `url.full`, `url.scheme` |
120
+ | `db.*` | Database client spans | `db.system` (e.g., `postgresql`, `mongodb`), `db.operation.name`, `db.collection.name`, `db.query.text` (sanitized) |
121
+ | `rpc.*` | RPC client and server spans | `rpc.system` (e.g., `grpc`, `jsonrpc`), `rpc.service`, `rpc.method`, `rpc.grpc.status_code` |
122
+ | `messaging.*` | Message queue spans | `messaging.system` (e.g., `kafka`, `rabbitmq`), `messaging.operation.type` (`publish`, `receive`, `process`), `messaging.destination.name` |
123
+ | `faas.*` | Serverless/FaaS invocations | `faas.trigger` (`http`, `pubsub`, `timer`), `faas.invoked_name`, `faas.coldstart` |
124
+ | `cloud.*` | Cloud provider context | `cloud.provider`, `cloud.region`, `cloud.availability_zone`, `cloud.account.id` |
125
+ | `k8s.*` | Kubernetes context | `k8s.namespace.name`, `k8s.pod.name`, `k8s.deployment.name`, `k8s.container.name` |
126
+
127
+ - Use the semantic convention attribute names exactly as specified. Do not invent custom alternatives for concepts already covered by the conventions.
128
+ - When semantic conventions are marked "Experimental," prefer them over project-specific names to ease future migration to stable conventions.
129
+
130
+ ### Resource Semantic Conventions
131
+
132
+ Every telemetry-producing service must declare resource attributes at startup:
133
+
134
+ | Attribute | Stability | Requirement | Description |
135
+ |-----------|-----------|-------------|-------------|
136
+ | `service.name` | Stable | Required | Logical name of the service (e.g., `api-gateway`, `auth-service`) |
137
+ | `service.version` | Stable | Recommended | Semantic version of the service (e.g., `1.4.2`) |
138
+ | `deployment.environment.name` | Stable | Recommended | Deployment environment (e.g., `production`, `staging`, `development`) |
139
+ | `service.instance.id` | Experimental | Recommended | Unique instance identifier (e.g., pod name, container ID) |
140
+ | `service.namespace` | Experimental | Optional | Namespace for grouping related services |
141
+ | `telemetry.sdk.name` | Stable | Auto | Set by the SDK (e.g., `opentelemetry`) |
142
+ | `telemetry.sdk.language` | Stable | Auto | Set by the SDK (e.g., `nodejs`, `python`) |
143
+ | `telemetry.sdk.version` | Stable | Auto | Set by the SDK |
144
+
145
+ - Configure `service.name` and `service.version` via environment variables (`OTEL_SERVICE_NAME`, `OTEL_RESOURCE_ATTRIBUTES`) or programmatically at SDK initialization.
146
+ - Do not use the default `unknown_service` value in any deployed environment. Every service must have an explicit name.
147
+
148
+ ### Span Status Codes
149
+
150
+ | Code | When to Set |
151
+ |------|-------------|
152
+ | `UNSET` | Default. The span completed without the instrumentation indicating an error. |
153
+ | `OK` | Explicitly set only when the application considers the operation successful and wants to override any lower-level error signal. Use sparingly. |
154
+ | `ERROR` | The operation failed. Set when an exception is caught, an HTTP response is 5xx, or a business-logic error occurs that should be visible in error rate metrics. |
155
+
156
+ - Set span status to `ERROR` for server-side errors (5xx) and unhandled exceptions. Do not set `ERROR` for client errors (4xx) on the server span — those are valid responses, not server failures.
157
+ - Attach the exception to the span as a span event (`exception.type`, `exception.message`, `exception.stacktrace`) when setting status to `ERROR`.
158
+ - Use `OK` only when you want to suppress error signals from child spans. In most cases, leaving status as `UNSET` is correct.
159
+
160
+ ### Attribute Naming Guidelines
161
+
162
+ - Use dot-separated namespaces: `http.request.method`, not `httpRequestMethod` or `http_request_method`.
163
+ - Attribute values should be low-cardinality. Never use unbounded values (full URLs with query params, raw SQL, user-generated content) as attribute values.
164
+ - For high-cardinality identifiers (user IDs, request IDs), use span attributes sparingly and rely on correlated logs for detail.
165
+ - Prefer semantic convention attributes over custom attributes. When custom attributes are necessary, prefix them with your organization or project namespace (e.g., `myapp.feature.flag_key`).
@@ -0,0 +1,165 @@
1
+ ---
2
+ description: Logging, metrics, and tracing conventions for the project
3
+ alwaysApply: false
4
+ ---
5
+ # Observability
6
+
7
+ ## Structured Logging
8
+
9
+ - Use structured JSON logging. No `console.log` in production code.
10
+ - Log levels: `error` (failures), `warn` (degraded), `info` (state changes), `debug` (dev only).
11
+ - Every log entry includes `correlationId` and `userId` (if available).
12
+ - Never log secrets, PII, tokens, passwords, or sensitive content.
13
+ - Instrument key operations with timing metrics. Serverless functions log execution time and outcome.
14
+ - Client-side: log errors to a sink (e.g., error reporting service), not just `console.error`.
15
+ - Prefer event-based metrics over polling. Trace user flows end-to-end with `correlationId`.
16
+ - Respect performance budgets: logging must not add > 10ms latency to hot paths.
17
+ - Include `service`, `environment`, and `version` fields in every log entry for filtering.
18
+ - Use log sampling for high-volume debug logs in production (e.g., 1% sample rate).
19
+
20
+ ## Distributed Tracing
21
+
22
+ - Use OpenTelemetry SDK for all tracing instrumentation. Initialize the TracerProvider once at application startup before any instrumented libraries load.
23
+ - Propagate trace context via W3C Trace Context headers (`traceparent`, `tracestate`) across all service boundaries, queues, and async workflows.
24
+ - Span naming conventions:
25
+
26
+ | Span Type | Pattern | Example |
27
+ | ----------- | ------------------------------ | --------------------------- |
28
+ | HTTP server | `HTTP {method} {route}` | `HTTP GET /api/users/:id` |
29
+ | HTTP client | `HTTP {method} {host}{path}` | `HTTP POST api.stripe.com/` |
30
+ | DB query | `{db.system} {operation}` | `firestore getDoc` |
31
+ | Queue | `{queue} {operation}` | `tasks-queue publish` |
32
+ | Internal | `{module}.{function}` | `auth.verifyToken` |
33
+
34
+ - Required span attributes: `service.name`, `service.version`, `deployment.environment`. Add domain-specific attributes (e.g., `user.id`, `tenant.id`) where relevant.
35
+ - Parent-child span relationships: every outbound call (HTTP, DB, queue) creates a child span of the current context. Never create orphan spans.
36
+ - Sampling strategies: use `ParentBased(TraceIdRatioBased(0.1))` in production (10% sample rate). Always sample errors and slow requests (> p95 latency) at 100%.
37
+ - Use the OpenTelemetry Collector as a gateway between applications and backends to enable batching, retrying, and vendor-neutral export.
38
+ - Keep span event count low (< 32 per span). For high-volume events, use correlated logs or `SpanLink` instead.
39
+
40
+ ## Metrics
41
+
42
+ - Use OpenTelemetry Metrics SDK. Expose Prometheus-compatible `/metrics` endpoint for scraping where applicable.
43
+ - Metric naming: `{service}.{domain}.{metric}_{unit}` in snake_case. Example: `api.auth.login_duration_ms`.
44
+ - Instrument types and when to use:
45
+
46
+ | Instrument | Use Case | Example |
47
+ | ----------- | ---------------------------------- | -------------------------------- |
48
+ | Counter | Monotonically increasing totals | `http.requests_total` |
49
+ | Histogram | Distributions (latency, size) | `http.request_duration_ms` |
50
+ | Gauge | Point-in-time values | `db.connection_pool_active` |
51
+ | UpDownCounter | Values that increase and decrease | `queue.messages_pending` |
52
+
53
+ - Histogram buckets for latency: `[5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]` ms.
54
+ - Cardinality management: never use unbounded values (user IDs, request paths with params) as metric labels. Cap label cardinality to < 100 unique values per metric.
55
+ - Custom business metrics: track domain-significant events (sign-ups, purchases, feature usage) as counters with relevant dimensions.
56
+
57
+ ## SLO / SLI Definitions
58
+
59
+ - Define SLIs as ratios of good events to total events, measured from the user's perspective.
60
+ - Standard SLIs:
61
+
62
+ | SLI | Definition | Measurement Source |
63
+ | ---------------- | --------------------------------------------- | ------------------------ |
64
+ | Availability | Requests returning non-5xx / total requests | Load balancer logs |
65
+ | Latency | Requests completing < threshold / total | Tracing p99 |
66
+ | Error rate | Failed operations / total operations | Application metrics |
67
+ | Freshness | Data updated within SLA / total records | Background job metrics |
68
+
69
+ - SLO targets: set per-service. Typical starting points: 99.9% availability (43 min/month budget), p99 latency < 500ms.
70
+ - Error budgets: `budget = 1 - SLO_target`. Track remaining budget on a rolling 30-day window.
71
+ - Burn rate alerts: use multi-window approach (short + long window). Fast-burn alert: 2% budget consumed in 1 hour. Slow-burn alert: 5% consumed in 6 hours. Alert only when both windows confirm.
72
+
73
+ ## Alerting
74
+
75
+ | Severity | Criteria | Response Time | Notification |
76
+ | -------- | ----------------------------------- | ------------- | ------------------- |
77
+ | P1 | Service down, data loss risk | 15 min | Page on-call + Slack |
78
+ | P2 | Degraded performance, SLO at risk | 1 hour | Page on-call |
79
+ | P3 | Non-critical issue, workaround exists | Next business day | Slack channel |
80
+ | P4 | Cosmetic / low-impact | Sprint backlog | Ticket only |
81
+
82
+ - Every alert must link to a runbook with: symptoms, likely causes, diagnostic steps, remediation actions.
83
+ - Alert fatigue prevention: tune thresholds to < 5 actionable alerts per on-call shift. Suppress duplicate alerts within a 10-minute dedup window.
84
+ - Route alerts by service ownership. Use escalation policies: if P1/P2 unacknowledged in 15 min, escalate to secondary.
85
+ - Review alert quality monthly: snooze/delete alerts with < 20% action rate.
86
+
87
+ ## Structured Error Reporting
88
+
89
+ - Integrate Sentry (or equivalent) for automated error capture in both server and client environments.
90
+ - Configure release tracking: tag errors with `release` (git SHA or semver) and upload source maps for readable stack traces.
91
+ - Enable breadcrumbs: capture the last 50 user actions, network requests, and console messages leading to an error.
92
+ - Error grouping: use custom fingerprints for domain-specific errors to prevent over-grouping. Default fingerprinting is acceptable for unhandled exceptions.
93
+ - Enrich error context with `correlationId`, `userId`, environment, and relevant business state. Never attach PII or secrets.
94
+ - Set sample rates: 100% for errors, 10% for transactions in production. Adjust based on volume and budget.
95
+
96
+ ## Dashboard Standards
97
+
98
+ - Required dashboards per service:
99
+
100
+ | Dashboard | Contents |
101
+ | ---------------- | ----------------------------------------------------------- |
102
+ | Service Health | Request rate, error rate, latency p50/p95/p99, saturation |
103
+ | Business Metrics | Key domain counters, conversion funnels, feature adoption |
104
+ | Dependencies | Upstream/downstream latency, error rates, circuit breaker state |
105
+ | Infrastructure | CPU, memory, disk, connection pools, queue depth |
106
+
107
+ - Dashboard-as-code: define dashboards in version-controlled JSON/YAML (Grafana provisioning, Terraform, or equivalent). No manual dashboard creation in production.
108
+ - Every dashboard panel includes: descriptive title, unit labels, threshold lines for SLO targets, and a link to the relevant runbook or alert.
109
+ - Review dashboards quarterly: remove unused panels, update thresholds, verify data source accuracy.
110
+
111
+ ## OpenTelemetry Semantic Conventions
112
+
113
+ Follow the [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) (v1.29+) for consistent attribute naming across all telemetry signals. Semantic conventions ensure interoperability between instrumentation libraries, collectors, and observability backends.
114
+
115
+ ### Standard Attribute Namespaces
116
+
117
+ | Namespace | Scope | Key Attributes |
118
+ |-----------|-------|----------------|
119
+ | `http.*` | HTTP client and server spans | `http.request.method`, `http.response.status_code`, `http.route`, `url.full`, `url.scheme` |
120
+ | `db.*` | Database client spans | `db.system` (e.g., `postgresql`, `mongodb`), `db.operation.name`, `db.collection.name`, `db.query.text` (sanitized) |
121
+ | `rpc.*` | RPC client and server spans | `rpc.system` (e.g., `grpc`, `jsonrpc`), `rpc.service`, `rpc.method`, `rpc.grpc.status_code` |
122
+ | `messaging.*` | Message queue spans | `messaging.system` (e.g., `kafka`, `rabbitmq`), `messaging.operation.type` (`publish`, `receive`, `process`), `messaging.destination.name` |
123
+ | `faas.*` | Serverless/FaaS invocations | `faas.trigger` (`http`, `pubsub`, `timer`), `faas.invoked_name`, `faas.coldstart` |
124
+ | `cloud.*` | Cloud provider context | `cloud.provider`, `cloud.region`, `cloud.availability_zone`, `cloud.account.id` |
125
+ | `k8s.*` | Kubernetes context | `k8s.namespace.name`, `k8s.pod.name`, `k8s.deployment.name`, `k8s.container.name` |
126
+
127
+ - Use the semantic convention attribute names exactly as specified. Do not invent custom alternatives for concepts already covered by the conventions.
128
+ - When semantic conventions are marked "Experimental," prefer them over project-specific names to ease future migration to stable conventions.
129
+
130
+ ### Resource Semantic Conventions
131
+
132
+ Every telemetry-producing service must declare resource attributes at startup:
133
+
134
+ | Attribute | Stability | Requirement | Description |
135
+ |-----------|-----------|-------------|-------------|
136
+ | `service.name` | Stable | Required | Logical name of the service (e.g., `api-gateway`, `auth-service`) |
137
+ | `service.version` | Stable | Recommended | Semantic version of the service (e.g., `1.4.2`) |
138
+ | `deployment.environment.name` | Stable | Recommended | Deployment environment (e.g., `production`, `staging`, `development`) |
139
+ | `service.instance.id` | Experimental | Recommended | Unique instance identifier (e.g., pod name, container ID) |
140
+ | `service.namespace` | Experimental | Optional | Namespace for grouping related services |
141
+ | `telemetry.sdk.name` | Stable | Auto | Set by the SDK (e.g., `opentelemetry`) |
142
+ | `telemetry.sdk.language` | Stable | Auto | Set by the SDK (e.g., `nodejs`, `python`) |
143
+ | `telemetry.sdk.version` | Stable | Auto | Set by the SDK |
144
+
145
+ - Configure `service.name` and `service.version` via environment variables (`OTEL_SERVICE_NAME`, `OTEL_RESOURCE_ATTRIBUTES`) or programmatically at SDK initialization.
146
+ - Do not use the default `unknown_service` value in any deployed environment. Every service must have an explicit name.
147
+
148
+ ### Span Status Codes
149
+
150
+ | Code | When to Set |
151
+ |------|-------------|
152
+ | `UNSET` | Default. The span completed without the instrumentation indicating an error. |
153
+ | `OK` | Explicitly set only when the application considers the operation successful and wants to override any lower-level error signal. Use sparingly. |
154
+ | `ERROR` | The operation failed. Set when an exception is caught, an HTTP response is 5xx, or a business-logic error occurs that should be visible in error rate metrics. |
155
+
156
+ - Set span status to `ERROR` for server-side errors (5xx) and unhandled exceptions. Do not set `ERROR` for client errors (4xx) on the server span — those are valid responses, not server failures.
157
+ - Attach the exception to the span as a span event (`exception.type`, `exception.message`, `exception.stacktrace`) when setting status to `ERROR`.
158
+ - Use `OK` only when you want to suppress error signals from child spans. In most cases, leaving status as `UNSET` is correct.
159
+
160
+ ### Attribute Naming Guidelines
161
+
162
+ - Use dot-separated namespaces: `http.request.method`, not `httpRequestMethod` or `http_request_method`.
163
+ - Attribute values should be low-cardinality. Never use unbounded values (full URLs with query params, raw SQL, user-generated content) as attribute values.
164
+ - For high-cardinality identifiers (user IDs, request IDs), use span attributes sparingly and rely on correlated logs for detail.
165
+ - Prefer semantic convention attributes over custom attributes. When custom attributes are necessary, prefix them with your organization or project namespace (e.g., `myapp.feature.flag_key`).
@@ -0,0 +1,109 @@
1
+ ---
2
+ description: Performance budgets and targets for the project
3
+ alwaysApply: false
4
+ ---
5
+ # Performance Budgets
6
+
7
+ ## Application Budgets
8
+
9
+ | Metric | Budget |
10
+ | --------------------------------- | -------------------- |
11
+ | UI render | 60fps (16ms/frame) |
12
+ | Cold start to interactive | 1.5 seconds |
13
+ | Idle CPU usage | 1% |
14
+ | Memory footprint | 30 MB |
15
+ | Event processing latency | 10ms per event |
16
+ | Bundle size (initial, gzipped) | 500 KB |
17
+ | Backend reads per session start | ≤ 5 documents |
18
+ | Serverless warm execution | 500ms |
19
+ | Serverless cold start | 3 seconds |
20
+ | Main thread long tasks | 0 tasks > 50ms |
21
+
22
+ ## Core Web Vitals
23
+
24
+ Targets align with Google's "Good" thresholds (measured at p75 from real user data):
25
+
26
+ | Metric | Good | Needs Improvement | Poor | What It Measures |
27
+ | ------ | ----------- | ----------------- | ---------- | --------------------------- |
28
+ | LCP | ≤ 2.5s | 2.5–4.0s | > 4.0s | Loading — largest visible element render time |
29
+ | INP | ≤ 200ms | 200–500ms | > 500ms | Responsiveness — worst interaction latency across full session |
30
+ | CLS | ≤ 0.1 | 0.1–0.25 | > 0.25 | Visual stability — cumulative unexpected layout shifts |
31
+
32
+ - **LCP optimization**: preload hero images/fonts, inline critical CSS, use `fetchpriority="high"` on LCP elements, server-side render above-the-fold content.
33
+ - **INP optimization**: break long tasks with `scheduler.yield()`, defer non-critical event handlers, avoid layout thrashing in input handlers, keep interaction callbacks < 100ms.
34
+ - **CLS optimization**: set explicit `width`/`height` on images and embeds, reserve space for dynamic content, avoid inserting content above the viewport after load.
35
+
36
+ ## API Response Time Budgets
37
+
38
+ | Endpoint Type | p50 | p95 | p99 | Timeout |
39
+ | -------------- | ------- | ------- | ------- | ------- |
40
+ | Read (single) | 50ms | 150ms | 300ms | 5s |
41
+ | Read (list) | 100ms | 300ms | 500ms | 10s |
42
+ | Write (create) | 100ms | 250ms | 500ms | 10s |
43
+ | Write (update) | 75ms | 200ms | 400ms | 10s |
44
+ | Search | 150ms | 500ms | 1000ms | 15s |
45
+ | Aggregation | 200ms | 800ms | 2000ms | 30s |
46
+ | Auth flow | 100ms | 300ms | 500ms | 10s |
47
+
48
+ - All endpoints must return within their timeout or abort with a structured error.
49
+ - Client-side: set `AbortController` timeouts matching the table above. Show loading indicators after 300ms.
50
+
51
+ ## Database Query Budgets
52
+
53
+ | Constraint | Budget / Rule |
54
+ | --------------------------------- | ------------------------------------ |
55
+ | Single query execution | < 100ms |
56
+ | N+1 detection | 0 N+1 patterns (enforced via lint/review) |
57
+ | Queries per request | ≤ 10 |
58
+ | Connection pool size | 10–20 per service instance |
59
+ | Index coverage | Every query in hot paths must hit an index |
60
+ | Full collection scans | Forbidden in production code |
61
+ | Batch read size | ≤ 100 documents per batch |
62
+
63
+ - Use query explain plans during code review for new or modified queries.
64
+ - Log slow queries (> 200ms) with the query pattern and execution plan.
65
+
66
+ ## Network Budgets
67
+
68
+ | Constraint | Budget |
69
+ | --------------------------------- | ------------------------------------ |
70
+ | Requests per page load | ≤ 30 |
71
+ | Total transfer size (initial) | ≤ 1.5 MB |
72
+ | Image payload per page | ≤ 500 KB |
73
+ | Individual API response | ≤ 100 KB (gzipped) |
74
+ | WebSocket message size | ≤ 10 KB |
75
+ | Font files loaded | ≤ 2 families, WOFF2 only |
76
+ | Third-party script budget | ≤ 100 KB (gzipped) |
77
+
78
+ - Images: serve WebP/AVIF with responsive `srcset`. Lazy-load below-the-fold images.
79
+ - Fonts: use `font-display: swap`, preload critical fonts, subset to used character ranges.
80
+ - Enable Brotli or gzip compression on all text responses. Set `Cache-Control` headers per resource type.
81
+
82
+ ## Enforcement Mechanisms
83
+
84
+ | Gate | Tool / Method | Trigger | Threshold |
85
+ | ---------------------- | -------------------------- | ---------------- | ---------------------------- |
86
+ | Bundle size | `size-limit` in CI | Every PR | Fail if > 500 KB gzipped |
87
+ | Lighthouse score | Lighthouse CI assertions | Every PR | Performance ≥ 90 |
88
+ | Core Web Vitals | Lighthouse CI + CrUX | PR + weekly | All metrics in "Good" range |
89
+ | API latency | Integration test assertions | Every PR | p95 within budget table |
90
+ | DB query count | Test-level query counter | Every PR | ≤ 10 per request |
91
+ | Runtime regression | RUM dashboards + alerts | Continuous | Alert on p75 regression > 10% |
92
+ | Synthetic monitoring | Scheduled Lighthouse runs | Hourly/daily | Slack alert on score drop |
93
+
94
+ - CI must block merge if any enforcement gate fails. No manual overrides without tech-lead approval.
95
+ - Track budget trends over time: plot bundle size, LCP, and INP per release on the performance dashboard.
96
+
97
+ ## Measurement Tooling
98
+
99
+ | Metric Category | Lab / CI Tool | Field / Production Tool |
100
+ | ------------------ | ---------------------------------- | -------------------------------- |
101
+ | Core Web Vitals | Lighthouse CI, WebPageTest | CrUX, `web-vitals` library (RUM) |
102
+ | Bundle size | `size-limit`, `webpack-bundle-analyzer` | N/A |
103
+ | API latency | Integration tests, k6 | OpenTelemetry histograms |
104
+ | DB performance | Query explain, test query counters | Slow query logs, APM traces |
105
+ | Memory / CPU | Chrome DevTools, profiler | Infrastructure metrics (Prometheus) |
106
+ | Visual regression | Playwright screenshot diffing | RUM CLS tracking |
107
+
108
+ - Automated regression detection: compare each PR's metrics against the `main` branch baseline. Flag regressions > 5% in any budget metric.
109
+ - Review performance budgets quarterly and tighten thresholds as the application matures.
@@ -0,0 +1,109 @@
1
+ ---
2
+ description: Performance budgets and targets for the project
3
+ alwaysApply: false
4
+ ---
5
+ # Performance Budgets
6
+
7
+ ## Application Budgets
8
+
9
+ | Metric | Budget |
10
+ | --------------------------------- | -------------------- |
11
+ | UI render | 60fps (16ms/frame) |
12
+ | Cold start to interactive | 1.5 seconds |
13
+ | Idle CPU usage | 1% |
14
+ | Memory footprint | 30 MB |
15
+ | Event processing latency | 10ms per event |
16
+ | Bundle size (initial, gzipped) | 500 KB |
17
+ | Backend reads per session start | ≤ 5 documents |
18
+ | Serverless warm execution | 500ms |
19
+ | Serverless cold start | 3 seconds |
20
+ | Main thread long tasks | 0 tasks > 50ms |
21
+
22
+ ## Core Web Vitals
23
+
24
+ Targets align with Google's "Good" thresholds (measured at p75 from real user data):
25
+
26
+ | Metric | Good | Needs Improvement | Poor | What It Measures |
27
+ | ------ | ----------- | ----------------- | ---------- | --------------------------- |
28
+ | LCP | ≤ 2.5s | 2.5–4.0s | > 4.0s | Loading — largest visible element render time |
29
+ | INP | ≤ 200ms | 200–500ms | > 500ms | Responsiveness — worst interaction latency across full session |
30
+ | CLS | ≤ 0.1 | 0.1–0.25 | > 0.25 | Visual stability — cumulative unexpected layout shifts |
31
+
32
+ - **LCP optimization**: preload hero images/fonts, inline critical CSS, use `fetchpriority="high"` on LCP elements, server-side render above-the-fold content.
33
+ - **INP optimization**: break long tasks with `scheduler.yield()`, defer non-critical event handlers, avoid layout thrashing in input handlers, keep interaction callbacks < 100ms.
34
+ - **CLS optimization**: set explicit `width`/`height` on images and embeds, reserve space for dynamic content, avoid inserting content above the viewport after load.
35
+
36
+ ## API Response Time Budgets
37
+
38
+ | Endpoint Type | p50 | p95 | p99 | Timeout |
39
+ | -------------- | ------- | ------- | ------- | ------- |
40
+ | Read (single) | 50ms | 150ms | 300ms | 5s |
41
+ | Read (list) | 100ms | 300ms | 500ms | 10s |
42
+ | Write (create) | 100ms | 250ms | 500ms | 10s |
43
+ | Write (update) | 75ms | 200ms | 400ms | 10s |
44
+ | Search | 150ms | 500ms | 1000ms | 15s |
45
+ | Aggregation | 200ms | 800ms | 2000ms | 30s |
46
+ | Auth flow | 100ms | 300ms | 500ms | 10s |
47
+
48
+ - All endpoints must return within their timeout or abort with a structured error.
49
+ - Client-side: set `AbortController` timeouts matching the table above. Show loading indicators after 300ms.
50
+
51
+ ## Database Query Budgets
52
+
53
+ | Constraint | Budget / Rule |
54
+ | --------------------------------- | ------------------------------------ |
55
+ | Single query execution | < 100ms |
56
+ | N+1 detection | 0 N+1 patterns (enforced via lint/review) |
57
+ | Queries per request | ≤ 10 |
58
+ | Connection pool size | 10–20 per service instance |
59
+ | Index coverage | Every query in hot paths must hit an index |
60
+ | Full collection scans | Forbidden in production code |
61
+ | Batch read size | ≤ 100 documents per batch |
62
+
63
+ - Use query explain plans during code review for new or modified queries.
64
+ - Log slow queries (> 200ms) with the query pattern and execution plan.
65
+
66
+ ## Network Budgets
67
+
68
+ | Constraint | Budget |
69
+ | --------------------------------- | ------------------------------------ |
70
+ | Requests per page load | ≤ 30 |
71
+ | Total transfer size (initial) | ≤ 1.5 MB |
72
+ | Image payload per page | ≤ 500 KB |
73
+ | Individual API response | ≤ 100 KB (gzipped) |
74
+ | WebSocket message size | ≤ 10 KB |
75
+ | Font files loaded | ≤ 2 families, WOFF2 only |
76
+ | Third-party script budget | ≤ 100 KB (gzipped) |
77
+
78
+ - Images: serve WebP/AVIF with responsive `srcset`. Lazy-load below-the-fold images.
79
+ - Fonts: use `font-display: swap`, preload critical fonts, subset to used character ranges.
80
+ - Enable Brotli or gzip compression on all text responses. Set `Cache-Control` headers per resource type.
81
+
82
+ ## Enforcement Mechanisms
83
+
84
+ | Gate | Tool / Method | Trigger | Threshold |
85
+ | ---------------------- | -------------------------- | ---------------- | ---------------------------- |
86
+ | Bundle size | `size-limit` in CI | Every PR | Fail if > 500 KB gzipped |
87
+ | Lighthouse score | Lighthouse CI assertions | Every PR | Performance ≥ 90 |
88
+ | Core Web Vitals | Lighthouse CI + CrUX | PR + weekly | All metrics in "Good" range |
89
+ | API latency | Integration test assertions | Every PR | p95 within budget table |
90
+ | DB query count | Test-level query counter | Every PR | ≤ 10 per request |
91
+ | Runtime regression | RUM dashboards + alerts | Continuous | Alert on p75 regression > 10% |
92
+ | Synthetic monitoring | Scheduled Lighthouse runs | Hourly/daily | Slack alert on score drop |
93
+
94
+ - CI must block merge if any enforcement gate fails. No manual overrides without tech-lead approval.
95
+ - Track budget trends over time: plot bundle size, LCP, and INP per release on the performance dashboard.
96
+
97
+ ## Measurement Tooling
98
+
99
+ | Metric Category | Lab / CI Tool | Field / Production Tool |
100
+ | ------------------ | ---------------------------------- | -------------------------------- |
101
+ | Core Web Vitals | Lighthouse CI, WebPageTest | CrUX, `web-vitals` library (RUM) |
102
+ | Bundle size | `size-limit`, `webpack-bundle-analyzer` | N/A |
103
+ | API latency | Integration tests, k6 | OpenTelemetry histograms |
104
+ | DB performance | Query explain, test query counters | Slow query logs, APM traces |
105
+ | Memory / CPU | Chrome DevTools, profiler | Infrastructure metrics (Prometheus) |
106
+ | Visual regression | Playwright screenshot diffing | RUM CLS tracking |
107
+
108
+ - Automated regression detection: compare each PR's metrics against the `main` branch baseline. Flag regressions > 5% in any budget metric.
109
+ - Review performance budgets quarterly and tighten thresholds as the application matures.
@@ -0,0 +1,76 @@
1
+ ---
2
+ id: hatch3r-secrets-management
3
+ type: rule
4
+ description: Secret management, rotation, and secure handling patterns for the project
5
+ scope: always
6
+ ---
7
+ # Secrets Management
8
+
9
+ ## Env Var Management
10
+
11
+ - Store configuration and secrets in environment variables. Never hard-code secrets (API keys, tokens, passwords, connection strings) in source code, comments, or commit messages.
12
+ - Maintain a `.env.example` file in the repository root listing every required environment variable with placeholder values and brief descriptions. Update it whenever a new env var is introduced.
13
+ - Actual `.env` files must be in `.gitignore`. Verify `.gitignore` includes `.env`, `.env.local`, `.env.*.local`, and any other secret-bearing files before every commit.
14
+ - Use a typed env loader (e.g., `@t3-oss/env-core`, `envalid`, `zod` schema) to validate and parse environment variables at application startup. Fail fast with a clear error message listing missing or invalid variables.
15
+ - Separate secrets by environment: `.env.development`, `.env.staging`, `.env.production`. Never share production secrets with development environments.
16
+ - Document the source of each secret (which service/provider generates it) in `.env.example` or an internal secrets inventory document.
17
+
18
+ ## Secret Rotation Policies
19
+
20
+ | Secret Type | Rotation Frequency | Trigger for Immediate Rotation |
21
+ |-------------|-------------------|-------------------------------|
22
+ | API keys (third-party) | Every 90 days | Suspected compromise, employee departure |
23
+ | Database credentials | Every 90 days | Credential exposure, personnel change |
24
+ | JWT signing keys | Every 180 days | Algorithm upgrade, key compromise |
25
+ | Webhook secrets | Every 180 days | Partner-side breach, integration change |
26
+ | Service account tokens | Every 90 days | Scope change, compromise |
27
+ | Encryption keys | Every 365 days (or per compliance) | Algorithm vulnerability, compliance audit |
28
+ | OAuth client secrets | Every 180 days | Provider breach, app re-registration |
29
+
30
+ - Implement dual-key (overlap) rotation: deploy the new secret, update all consumers, verify, then revoke the old secret. Never revoke before all consumers have migrated.
31
+ - Automate rotation where the provider supports it (AWS Secrets Manager auto-rotation, GCP Secret Manager rotation schedules).
32
+ - Log every rotation event: who initiated, when, which secret (by name, never by value), success/failure.
33
+
34
+ ## Cloud Secret Managers
35
+
36
+ - **AWS Secrets Manager:** Store secrets as JSON key-value pairs. Use IAM policies to scope access per service/role. Enable automatic rotation with Lambda rotation functions. Use `aws-sdk` `GetSecretValue` at runtime — never bake secrets into container images or deployment artifacts.
37
+ - **GCP Secret Manager:** Store each secret as a versioned resource. Grant `secretmanager.secretAccessor` role per service account with resource-level IAM. Access secrets via `@google-cloud/secret-manager` client library at startup. Use secret versions (not `latest` in production) for auditability.
38
+ - **Azure Key Vault:** Store secrets, keys, and certificates. Use Managed Identity for authentication — no credentials needed to access the vault. Apply access policies per application identity. Enable soft-delete and purge protection.
39
+ - **HashiCorp Vault:** Use dynamic secrets where possible (database credentials, cloud IAM). Implement AppRole or Kubernetes auth for machine identity. Set TTLs on all leases. Audit log every access.
40
+ - **General:** Abstract the secret provider behind an interface so the application code is not coupled to a specific cloud provider. Inject the provider at startup via configuration.
41
+
42
+ ## CI/CD Secret Injection
43
+
44
+ - **GitHub Actions:** Store secrets in repository or organization settings. Reference via `${{ secrets.SECRET_NAME }}`. Never echo secrets in workflow logs — use masking (`::add-mask::`). Use environment-scoped secrets for production deployments with required reviewers.
45
+ - **GitLab CI:** Store in project or group CI/CD variables. Mark as "Protected" (only available on protected branches) and "Masked" (redacted from logs). Use file-type variables for multi-line secrets (certificates, keys).
46
+ - **General CI principles:** Secrets must not appear in build logs, artifacts, or cached layers. Pin CI action versions by SHA (not tag) to prevent supply chain attacks on secret-accessing workflows. Rotate CI secrets on the same schedule as application secrets.
47
+ - **Ephemeral secrets:** For CI jobs that need temporary cloud access, use OIDC federation (e.g., GitHub Actions `aws-actions/configure-aws-credentials` with OIDC) instead of long-lived credentials.
48
+
49
+ ## Application-Level Secret Handling
50
+
51
+ - **Never log secrets.** Sanitize log output to redact any field matching known secret patterns (tokens, keys, passwords). Use structured logging with an explicit allowlist of loggable fields.
52
+ - **Never serialize secrets** into JSON responses, error messages, stack traces, or client-side state. Treat secrets as write-only values: they go into the system but never come back out in any observable output.
53
+ - **Memory safety:** Clear secret values from memory after use where the language/runtime permits (overwrite buffers, null references). Avoid storing secrets in global/static variables that persist for the application lifetime.
54
+ - **Transport:** Transmit secrets only over TLS 1.2+. Never send secrets in URL query parameters (they appear in access logs and browser history). Use request headers or POST body.
55
+ - **Principle of least privilege:** Each service/function should have access only to the secrets it needs. Avoid a single "master" secret store accessible to all services.
56
+ - **Secret-bearing config objects:** Wrap secrets in a `Secret<T>` type that overrides `toString()` and `toJSON()` to return `"[REDACTED]"`. This prevents accidental exposure via logging or serialization.
57
+
58
+ ## Secret Scanning in CI
59
+
60
+ - **gitleaks:** Run `gitleaks detect` in CI on every push and PR. Configure `.gitleaks.toml` with project-specific rules and allowlists for false positives (test fixtures, documentation examples).
61
+ - **TruffleHog:** Use `trufflehog git` for historical scanning of the full repository. Run quarterly or on suspected compromise. Focus on high-entropy strings and known secret patterns.
62
+ - **GitHub Secret Scanning:** Enable GitHub's built-in secret scanning and push protection. Configure custom patterns for project-specific secret formats.
63
+ - **Pre-commit hooks:** Install a local pre-commit hook (e.g., `gitleaks protect --staged`) to catch secrets before they reach the remote. This is defense-in-depth — CI scanning is still required.
64
+ - **Remediation SLA:** Secrets detected in CI must be rotated immediately (within 1 hour for production secrets). Assume any secret that reached a commit is compromised, even if the commit was force-pushed away — git history is recoverable.
65
+
66
+ ## Emergency Rotation Procedures
67
+
68
+ When a secret is confirmed or suspected compromised:
69
+
70
+ 1. **Assess scope:** Identify which secret, which environments, and which services are affected. Check audit logs for unauthorized access.
71
+ 2. **Generate new secret:** Create a replacement via the appropriate provider (cloud console, API, or CLI). Do not reuse or derive from the compromised value.
72
+ 3. **Deploy new secret:** Update the secret in the secret manager and deploy affected services. Use blue-green or rolling deployment to avoid downtime.
73
+ 4. **Revoke old secret:** After confirming all services are using the new secret, revoke/delete the old one. Verify revocation by testing that the old value is rejected.
74
+ 5. **Audit impact:** Review access logs for the compromised secret's lifetime. Identify any unauthorized actions taken with the compromised credential.
75
+ 6. **Incident report:** Document the timeline, root cause (how the secret was exposed), blast radius, remediation steps, and preventive measures. File as a security incident per the project's incident response process.
76
+ 7. **Prevent recurrence:** Add scanning rules for the pattern that was missed, review access controls, and update rotation policies if the secret exceeded its rotation window.