hatch3r 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -6
- package/agents/hatch3r-a11y-auditor.md +13 -2
- package/agents/hatch3r-architect.md +20 -1
- package/agents/hatch3r-ci-watcher.md +25 -1
- package/agents/hatch3r-context-rules.md +15 -3
- package/agents/hatch3r-dependency-auditor.md +23 -2
- package/agents/hatch3r-devops.md +11 -0
- package/agents/hatch3r-docs-writer.md +27 -2
- package/agents/hatch3r-fixer.md +46 -3
- package/agents/hatch3r-implementer.md +19 -1
- package/agents/hatch3r-learnings-loader.md +19 -0
- package/agents/hatch3r-lint-fixer.md +11 -0
- package/agents/hatch3r-perf-profiler.md +21 -1
- package/agents/hatch3r-researcher.md +51 -911
- package/agents/hatch3r-reviewer.md +24 -2
- package/agents/hatch3r-security-auditor.md +20 -0
- package/agents/hatch3r-test-writer.md +24 -0
- package/agents/modes/architecture.md +1 -0
- package/agents/modes/boundary-analysis.md +2 -1
- package/agents/modes/codebase-impact.md +1 -0
- package/agents/modes/complexity-risk.md +1 -0
- package/agents/modes/coverage-analysis.md +1 -0
- package/agents/modes/current-state.md +1 -0
- package/agents/modes/feature-design.md +1 -0
- package/agents/modes/impact-analysis.md +1 -0
- package/agents/modes/library-docs.md +2 -1
- package/agents/modes/migration-path.md +1 -0
- package/agents/modes/prior-art.md +1 -0
- package/agents/modes/refactoring-strategy.md +1 -0
- package/agents/modes/regression.md +1 -0
- package/agents/modes/requirements-elicitation.md +1 -0
- package/agents/modes/risk-assessment.md +1 -0
- package/agents/modes/risk-prioritization.md +1 -0
- package/agents/modes/root-cause.md +1 -0
- package/agents/modes/similar-implementation.md +2 -1
- package/agents/modes/symptom-trace.md +1 -0
- package/agents/modes/test-pattern.md +2 -1
- package/agents/shared/external-knowledge.md +10 -0
- package/agents/shared/quality-charter.md +18 -0
- package/checks/README.md +1 -0
- package/checks/accessibility.md +55 -0
- package/commands/board/pickup-azure-devops.md +1 -0
- package/commands/board/pickup-delegation-multi.md +6 -1
- package/commands/board/pickup-delegation.md +1 -0
- package/commands/board/pickup-github.md +1 -0
- package/commands/board/pickup-gitlab.md +1 -0
- package/commands/board/pickup-modes.md +1 -0
- package/commands/board/pickup-post-impl.md +2 -1
- package/commands/board/shared-azure-devops.md +1 -0
- package/commands/board/shared-board-overview.md +1 -0
- package/commands/board/shared-github.md +1 -0
- package/commands/board/shared-gitlab.md +1 -0
- package/commands/hatch3r-agent-customize.md +1 -0
- package/commands/hatch3r-api-spec.md +1 -0
- package/commands/hatch3r-benchmark.md +4 -3
- package/commands/hatch3r-board-fill.md +52 -9
- package/commands/hatch3r-board-groom.md +69 -5
- package/commands/hatch3r-board-init.md +2 -1
- package/commands/hatch3r-board-pickup.md +1 -0
- package/commands/hatch3r-board-refresh.md +1 -0
- package/commands/hatch3r-board-shared.md +34 -3
- package/commands/hatch3r-bug-plan.md +2 -1
- package/commands/hatch3r-codebase-map.md +4 -3
- package/commands/hatch3r-command-customize.md +2 -1
- package/commands/hatch3r-context-health.md +1 -0
- package/commands/hatch3r-cost-tracking.md +1 -0
- package/commands/hatch3r-debug.md +4 -3
- package/commands/hatch3r-dep-audit.md +3 -0
- package/commands/hatch3r-feature-plan.md +3 -2
- package/commands/hatch3r-healthcheck.md +1 -0
- package/commands/hatch3r-hooks.md +5 -0
- package/commands/hatch3r-learn.md +1 -0
- package/commands/hatch3r-migration-plan.md +3 -2
- package/commands/hatch3r-onboard.md +2 -1
- package/commands/hatch3r-project-spec.md +4 -3
- package/commands/hatch3r-quick-change.md +2 -0
- package/commands/hatch3r-recipe.md +1 -0
- package/commands/hatch3r-refactor-plan.md +2 -1
- package/commands/hatch3r-release.md +4 -1
- package/commands/hatch3r-revision.md +2 -1
- package/commands/hatch3r-roadmap.md +5 -4
- package/commands/hatch3r-rule-customize.md +1 -0
- package/commands/hatch3r-security-audit.md +1 -0
- package/commands/hatch3r-skill-customize.md +1 -0
- package/commands/hatch3r-test-plan.md +3 -2
- package/commands/hatch3r-workflow.md +5 -0
- package/dist/cli/index.js +7467 -4582
- package/dist/cli/index.js.map +1 -1
- package/hooks/hatch3r-ci-failure.md +1 -0
- package/hooks/hatch3r-file-save.md +1 -0
- package/hooks/hatch3r-post-merge.md +1 -0
- package/hooks/hatch3r-pre-commit.md +1 -0
- package/hooks/hatch3r-pre-push.md +1 -0
- package/hooks/hatch3r-session-start.md +1 -0
- package/package.json +19 -4
- package/rules/hatch3r-accessibility-standards.md +2 -1
- package/rules/hatch3r-accessibility-standards.mdc +1 -1
- package/rules/hatch3r-agent-orchestration-detail.md +49 -1
- package/rules/hatch3r-agent-orchestration-detail.mdc +47 -1
- package/rules/hatch3r-agent-orchestration.md +87 -5
- package/rules/hatch3r-agent-orchestration.mdc +85 -5
- package/rules/hatch3r-api-design.md +2 -1
- package/rules/hatch3r-api-design.mdc +1 -1
- package/rules/hatch3r-browser-verification.md +4 -2
- package/rules/hatch3r-browser-verification.mdc +1 -0
- package/rules/hatch3r-ci-cd.md +2 -1
- package/rules/hatch3r-ci-cd.mdc +1 -1
- package/rules/hatch3r-code-standards.md +15 -2
- package/rules/hatch3r-code-standards.mdc +12 -0
- package/rules/hatch3r-component-conventions.md +2 -1
- package/rules/hatch3r-component-conventions.mdc +1 -0
- package/rules/hatch3r-data-classification.md +2 -1
- package/rules/hatch3r-data-classification.mdc +1 -1
- package/rules/hatch3r-deep-context.md +26 -1
- package/rules/hatch3r-deep-context.mdc +25 -1
- package/rules/hatch3r-dependency-management.md +2 -1
- package/rules/hatch3r-dependency-management.mdc +1 -1
- package/rules/hatch3r-feature-flags.md +2 -0
- package/rules/hatch3r-feature-flags.mdc +1 -0
- package/rules/hatch3r-git-conventions.md +2 -1
- package/rules/hatch3r-git-conventions.mdc +2 -1
- package/rules/hatch3r-i18n.md +2 -1
- package/rules/hatch3r-i18n.mdc +1 -0
- package/rules/hatch3r-learning-consult.md +11 -1
- package/rules/hatch3r-learning-consult.mdc +11 -1
- package/rules/hatch3r-migrations.md +2 -1
- package/rules/hatch3r-migrations.mdc +1 -1
- package/rules/hatch3r-observability-logging.md +34 -0
- package/rules/hatch3r-observability-logging.mdc +30 -0
- package/rules/hatch3r-observability-metrics.md +74 -0
- package/rules/hatch3r-observability-metrics.mdc +70 -0
- package/rules/hatch3r-observability-tracing-detail.md +160 -0
- package/rules/hatch3r-observability-tracing-detail.mdc +63 -0
- package/rules/hatch3r-observability-tracing.md +86 -0
- package/rules/hatch3r-observability-tracing.mdc +77 -0
- package/rules/hatch3r-observability.md +9 -448
- package/rules/hatch3r-observability.mdc +7 -448
- package/rules/hatch3r-performance-budgets.md +2 -0
- package/rules/hatch3r-performance-budgets.mdc +1 -0
- package/rules/hatch3r-secrets-management.md +2 -1
- package/rules/hatch3r-secrets-management.mdc +1 -1
- package/rules/hatch3r-security-patterns.md +3 -2
- package/rules/hatch3r-security-patterns.mdc +1 -1
- package/rules/hatch3r-testing.md +12 -2
- package/rules/hatch3r-testing.mdc +10 -1
- package/rules/hatch3r-theming.md +3 -2
- package/rules/hatch3r-theming.mdc +1 -0
- package/rules/hatch3r-tooling-hierarchy.md +3 -2
- package/rules/hatch3r-tooling-hierarchy.mdc +1 -1
- package/skills/hatch3r-a11y-audit/SKILL.md +11 -4
- package/skills/hatch3r-agent-customize/SKILL.md +1 -0
- package/skills/hatch3r-api-spec/SKILL.md +9 -2
- package/skills/hatch3r-architecture-review/SKILL.md +7 -0
- package/skills/hatch3r-bug-fix/SKILL.md +16 -7
- package/skills/hatch3r-ci-pipeline/SKILL.md +8 -1
- package/skills/hatch3r-command-customize/SKILL.md +1 -0
- package/skills/hatch3r-context-health/SKILL.md +23 -2
- package/skills/hatch3r-cost-tracking/SKILL.md +16 -6
- package/skills/hatch3r-customize/SKILL.md +8 -1
- package/skills/hatch3r-dep-audit/SKILL.md +9 -2
- package/skills/hatch3r-feature/SKILL.md +12 -4
- package/skills/hatch3r-gh-agentic-workflows/SKILL.md +7 -0
- package/skills/hatch3r-incident-response/SKILL.md +7 -0
- package/skills/hatch3r-issue-workflow/SKILL.md +8 -1
- package/skills/hatch3r-logical-refactor/SKILL.md +8 -1
- package/skills/hatch3r-migration/SKILL.md +7 -0
- package/skills/hatch3r-perf-audit/SKILL.md +9 -2
- package/skills/hatch3r-pr-creation/SKILL.md +8 -1
- package/skills/hatch3r-qa-validation/SKILL.md +8 -1
- package/skills/hatch3r-recipe/SKILL.md +8 -1
- package/skills/hatch3r-refactor/SKILL.md +10 -2
- package/skills/hatch3r-release/SKILL.md +8 -1
- package/skills/hatch3r-rule-customize/SKILL.md +1 -0
- package/skills/hatch3r-skill-customize/SKILL.md +1 -0
- package/skills/hatch3r-visual-refactor/SKILL.md +12 -5
|
@@ -3,7 +3,9 @@ id: hatch3r-browser-verification
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Browser-based verification for UI and user-facing changes
|
|
5
5
|
scope: conditional
|
|
6
|
+
globs: "**/*.vue,**/*.jsx,**/*.tsx,**/*.svelte,**/components/**,**/*.html,**/*.css,**/*.scss"
|
|
6
7
|
tags: [review]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
9
|
---
|
|
8
10
|
# Browser Verification
|
|
9
11
|
|
|
@@ -60,7 +62,7 @@ Commands in the "Does NOT Support" column are documentation-only, planning-only,
|
|
|
60
62
|
|
|
61
63
|
## Verification Protocol
|
|
62
64
|
|
|
63
|
-
### 1.
|
|
65
|
+
### 1. Confirm Dev Server is Running
|
|
64
66
|
|
|
65
67
|
- Check if the project's dev server is already running (check terminal output or process list).
|
|
66
68
|
- If not running, start it in the background using the project's dev command (e.g., `npm run dev`).
|
|
@@ -77,7 +79,7 @@ Commands in the "Does NOT Support" column are documentation-only, planning-only,
|
|
|
77
79
|
|
|
78
80
|
### 3. Capture Evidence
|
|
79
81
|
|
|
80
|
-
- Take screenshots of the affected surfaces showing the change
|
|
82
|
+
- Take screenshots of the affected surfaces showing the change produces the expected visual and interactive result.
|
|
81
83
|
- For before/after changes (visual refactors, bug fixes), capture both states when possible.
|
|
82
84
|
- Note any browser console errors or warnings in the verification summary.
|
|
83
85
|
- Include screenshots in the PR description or attach to the issue.
|
package/rules/hatch3r-ci-cd.md
CHANGED
|
@@ -2,8 +2,9 @@
|
|
|
2
2
|
id: hatch3r-ci-cd
|
|
3
3
|
type: rule
|
|
4
4
|
description: CI/CD pipeline standards covering stage gates, deployment strategies, and rollback procedures
|
|
5
|
-
scope:
|
|
5
|
+
scope: "**/.github/workflows/**,**/Dockerfile*,**/docker-compose*,**/.gitlab-ci*,**/Jenkinsfile,**/azure-pipelines*,**/.circleci/**,**/deploy/**,**/*pipeline*"
|
|
6
6
|
tags: [devops]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# CI/CD Standards
|
|
9
10
|
|
package/rules/hatch3r-ci-cd.mdc
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: CI/CD pipeline standards covering stage gates, deployment strategies, and rollback procedures
|
|
3
|
-
|
|
3
|
+
globs: ["**/.github/workflows/**", "**/Dockerfile*", "**/docker-compose*", "**/.gitlab-ci*", "**/Jenkinsfile", "**/azure-pipelines*", "**/.circleci/**", "**/deploy/**", "**/*pipeline*"]
|
|
4
4
|
---
|
|
5
5
|
# CI/CD Standards
|
|
6
6
|
|
|
@@ -3,7 +3,8 @@ id: hatch3r-code-standards
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Code quality and file naming conventions for the project
|
|
5
5
|
scope: always
|
|
6
|
-
tags: [core]
|
|
6
|
+
tags: [core, lang:typescript]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Code Standards
|
|
9
10
|
|
|
@@ -29,7 +30,7 @@ tags: [core]
|
|
|
29
30
|
### Discriminated Unions
|
|
30
31
|
|
|
31
32
|
- Model domain variants with discriminated unions over polymorphic classes or `type` string checks. Every variant must share a common literal discriminant field (e.g., `kind`, `type`, `status`).
|
|
32
|
-
- Use exhaustive `switch` with a `never` default case
|
|
33
|
+
- Use exhaustive `switch` with a `never` default case so the compiler errors when a new variant is added but not handled.
|
|
33
34
|
|
|
34
35
|
### Branded Types
|
|
35
36
|
|
|
@@ -91,6 +92,18 @@ tags: [core]
|
|
|
91
92
|
- Include `correlationId` in all error logs for tracing across client and server.
|
|
92
93
|
- No secrets, tokens, or PII in error messages or logs.
|
|
93
94
|
|
|
95
|
+
### Error Handling Anti-Patterns (Prohibited)
|
|
96
|
+
|
|
97
|
+
The following patterns are always wrong and must be flagged in review:
|
|
98
|
+
|
|
99
|
+
| Anti-Pattern | Why It Is Wrong | Correct Alternative |
|
|
100
|
+
|-------------|-----------------|---------------------|
|
|
101
|
+
| `catch (e) {}` (empty catch) | Silently swallows errors; failures become invisible | `catch (e) { logger.error('context', e); throw e; }` or handle with Result type |
|
|
102
|
+
| `catch (e) { return null; }` in auth paths | Fail-open: returns "no user" instead of "auth failed" | `catch (e) { throw new AuthError('auth_failed', e); }` |
|
|
103
|
+
| `as any` to fix type errors | Bypasses type safety; hides real type mismatches | Fix the actual type or use a proper type guard |
|
|
104
|
+
| `// @ts-ignore` without linked issue | Permanent type-safety hole | Fix the type error or add `// @ts-expect-error` with issue link |
|
|
105
|
+
| `try { ... } catch { return defaultValue; }` for all errors | Treats transient errors (network) same as permanent ones (validation) | Discriminate error types: retry transient, fail permanent |
|
|
106
|
+
|
|
94
107
|
## Import Ordering
|
|
95
108
|
|
|
96
109
|
Enforce consistent import ordering via linter rules (e.g., `eslint-plugin-import`). The canonical order:
|
|
@@ -88,6 +88,18 @@ alwaysApply: true
|
|
|
88
88
|
- Include `correlationId` in all error logs for tracing across client and server.
|
|
89
89
|
- No secrets, tokens, or PII in error messages or logs.
|
|
90
90
|
|
|
91
|
+
### Error Handling Anti-Patterns (Prohibited)
|
|
92
|
+
|
|
93
|
+
The following patterns are always wrong and must be flagged in review:
|
|
94
|
+
|
|
95
|
+
| Anti-Pattern | Why It Is Wrong | Correct Alternative |
|
|
96
|
+
|-------------|-----------------|---------------------|
|
|
97
|
+
| `catch (e) {}` (empty catch) | Silently swallows errors; failures become invisible | `catch (e) { logger.error('context', e); throw e; }` or handle with Result type |
|
|
98
|
+
| `catch (e) { return null; }` in auth paths | Fail-open: returns "no user" instead of "auth failed" | `catch (e) { throw new AuthError('auth_failed', e); }` |
|
|
99
|
+
| `as any` to fix type errors | Bypasses type safety; hides real type mismatches | Fix the actual type or use a proper type guard |
|
|
100
|
+
| `// @ts-ignore` without linked issue | Permanent type-safety hole | Fix the type error or add `// @ts-expect-error` with issue link |
|
|
101
|
+
| `try { ... } catch { return defaultValue; }` for all errors | Treats transient errors (network) same as permanent ones (validation) | Discriminate error types: retry transient, fail permanent |
|
|
102
|
+
|
|
91
103
|
## Import Ordering
|
|
92
104
|
|
|
93
105
|
Enforce consistent import ordering via linter rules (e.g., `eslint-plugin-import`). The canonical order:
|
|
@@ -5,6 +5,7 @@ description: Rules for component development in web applications
|
|
|
5
5
|
scope: conditional
|
|
6
6
|
globs: src/**/*.vue, src/**/*.tsx, src/**/*.jsx
|
|
7
7
|
tags: [implementation]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
8
9
|
---
|
|
9
10
|
# Component Conventions
|
|
10
11
|
|
|
@@ -78,7 +79,7 @@ tags: [implementation]
|
|
|
78
79
|
- Group related fields with `<fieldset>` and `<legend>` (e.g., "Shipping Address", "Payment Details").
|
|
79
80
|
- Use progressive disclosure for complex forms: show advanced options behind an expandable section or a follow-up step.
|
|
80
81
|
- Autofocus the first input field on form mount.
|
|
81
|
-
-
|
|
82
|
+
- Verify tab order follows the visual layout order by tabbing through the form — never use positive `tabindex` values.
|
|
82
83
|
|
|
83
84
|
### Submit Behavior
|
|
84
85
|
- Disable the submit button when the form has known validation errors (but keep it focusable for screen readers).
|
|
@@ -2,8 +2,9 @@
|
|
|
2
2
|
id: hatch3r-data-classification
|
|
3
3
|
type: rule
|
|
4
4
|
description: Data classification standards covering PII handling, encryption, retention policies, and regulatory compliance
|
|
5
|
-
scope:
|
|
5
|
+
scope: "**/models/**,**/schemas/**,**/schema*,**/database/**,**/db/**,**/*model*,**/*entity*,**/prisma/**,**/drizzle/**,**/*migration*"
|
|
6
6
|
tags: [security]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Data Classification Standards
|
|
9
10
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Data classification standards covering PII handling, encryption, retention policies, and regulatory compliance
|
|
3
|
-
|
|
3
|
+
globs: ["**/models/**", "**/schemas/**", "**/schema*", "**/database/**", "**/db/**", "**/*model*", "**/*entity*", "**/prisma/**", "**/drizzle/**", "**/*migration*"]
|
|
4
4
|
---
|
|
5
5
|
# Data Classification Standards
|
|
6
6
|
|
|
@@ -4,6 +4,7 @@ type: rule
|
|
|
4
4
|
description: Adaptive pre-implementation analysis — complexity scoring, requirements elicitation, similar implementation discovery, and transitive dependency tracing before coding
|
|
5
5
|
scope: always
|
|
6
6
|
tags: [core]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Deep Context Analysis
|
|
9
10
|
|
|
@@ -87,8 +88,32 @@ This rule augments — not replaces — the existing Universal Sub-Agent Pipelin
|
|
|
87
88
|
- **`type:bug`**: existing modes `symptom-trace`, `root-cause`, `codebase-impact` + `requirements-elicitation` per tier (bugs often have underspecified reproduction steps)
|
|
88
89
|
- **`type:refactor`**: existing modes `current-state`, `refactoring-strategy`, `migration-path` + `similar-implementation` per tier (refactors benefit most from convention alignment)
|
|
89
90
|
|
|
91
|
+
## Scoring Examples
|
|
92
|
+
|
|
93
|
+
To reduce ambiguity in tier assignment, here are worked examples:
|
|
94
|
+
|
|
95
|
+
**Example 1: "Fix typo in error message" -- Tier 1 (score 0)**
|
|
96
|
+
No signals triggered. Single file, no cross-module impact, no ambiguity.
|
|
97
|
+
|
|
98
|
+
**Example 2: "Add email validation to signup form" -- Tier 2 (score 4)**
|
|
99
|
+
- Multiple layers touched (API + UI): +3
|
|
100
|
+
- Estimated 2-3 files: +0
|
|
101
|
+
- Input validation is security-adjacent but not in a security-sensitive area: +0
|
|
102
|
+
- Clear requirements ("validate email format"): +0
|
|
103
|
+
- May trigger cross-cutting i18n for error messages: +1 (partial cross-cutting)
|
|
104
|
+
|
|
105
|
+
**Example 3: "Migrate auth from session-based to JWT" -- Tier 3 (score 12)**
|
|
106
|
+
- Multiple layers (auth middleware + API + UI + storage): +3
|
|
107
|
+
- Vague term "migrate" (scope unclear): +2
|
|
108
|
+
- Cross-cutting auth concern: +2
|
|
109
|
+
- Security-sensitive area: +2
|
|
110
|
+
- Behavioral contract change (session API to JWT API): +2
|
|
111
|
+
- Estimated >5 files: +1 (partial -- easily >5)
|
|
112
|
+
|
|
113
|
+
When a signal partially applies (e.g., "maybe 5 files, maybe 4"), round down. Tier upgrades from adaptation (see `hatch3r-agent-orchestration-detail`) compensate for underestimates.
|
|
114
|
+
|
|
90
115
|
## Exceptions
|
|
91
116
|
|
|
92
117
|
- **`hatch3r-quick-change` command**: Tier 1 items proceed without research. Tier 2 items get lightweight `similar-implementation` at `quick` depth. Tier 3 items must be routed to `hatch3r-workflow` (hard block).
|
|
93
|
-
- **Trivial single-line edits**: Always Tier 1 regardless of scoring signals. This is the only valid basis for skipping research
|
|
118
|
+
- **Trivial single-line edits**: Always Tier 1 regardless of scoring signals. This is the only valid basis for skipping research -- label-based shortcuts (e.g., `risk:low AND priority:p3`) are not sufficient alone.
|
|
94
119
|
- **`hatch3r-revision` command**: Operates on already-implemented code. Deep context analysis applies to the original implementation, not the revision pass.
|
|
@@ -84,8 +84,32 @@ This rule augments — not replaces — the existing Universal Sub-Agent Pipelin
|
|
|
84
84
|
- **`type:bug`**: existing modes `symptom-trace`, `root-cause`, `codebase-impact` + `requirements-elicitation` per tier (bugs often have underspecified reproduction steps)
|
|
85
85
|
- **`type:refactor`**: existing modes `current-state`, `refactoring-strategy`, `migration-path` + `similar-implementation` per tier (refactors benefit most from convention alignment)
|
|
86
86
|
|
|
87
|
+
## Scoring Examples
|
|
88
|
+
|
|
89
|
+
To reduce ambiguity in tier assignment, here are worked examples:
|
|
90
|
+
|
|
91
|
+
**Example 1: "Fix typo in error message" -- Tier 1 (score 0)**
|
|
92
|
+
No signals triggered. Single file, no cross-module impact, no ambiguity.
|
|
93
|
+
|
|
94
|
+
**Example 2: "Add email validation to signup form" -- Tier 2 (score 4)**
|
|
95
|
+
- Multiple layers touched (API + UI): +3
|
|
96
|
+
- Estimated 2-3 files: +0
|
|
97
|
+
- Input validation is security-adjacent but not in a security-sensitive area: +0
|
|
98
|
+
- Clear requirements ("validate email format"): +0
|
|
99
|
+
- May trigger cross-cutting i18n for error messages: +1 (partial cross-cutting)
|
|
100
|
+
|
|
101
|
+
**Example 3: "Migrate auth from session-based to JWT" -- Tier 3 (score 12)**
|
|
102
|
+
- Multiple layers (auth middleware + API + UI + storage): +3
|
|
103
|
+
- Vague term "migrate" (scope unclear): +2
|
|
104
|
+
- Cross-cutting auth concern: +2
|
|
105
|
+
- Security-sensitive area: +2
|
|
106
|
+
- Behavioral contract change (session API to JWT API): +2
|
|
107
|
+
- Estimated >5 files: +1 (partial -- easily >5)
|
|
108
|
+
|
|
109
|
+
When a signal partially applies (e.g., "maybe 5 files, maybe 4"), round down. Tier upgrades from adaptation (see `hatch3r-agent-orchestration-detail`) compensate for underestimates.
|
|
110
|
+
|
|
87
111
|
## Exceptions
|
|
88
112
|
|
|
89
113
|
- **`hatch3r-quick-change` command**: Tier 1 items proceed without research. Tier 2 items get lightweight `similar-implementation` at `quick` depth. Tier 3 items must be routed to `hatch3r-workflow` (hard block).
|
|
90
|
-
- **Trivial single-line edits**: Always Tier 1 regardless of scoring signals. This is the only valid basis for skipping research
|
|
114
|
+
- **Trivial single-line edits**: Always Tier 1 regardless of scoring signals. This is the only valid basis for skipping research -- label-based shortcuts (e.g., `risk:low AND priority:p3`) are not sufficient alone.
|
|
91
115
|
- **`hatch3r-revision` command**: Operates on already-implemented code. Deep context analysis applies to the original implementation, not the revision pass.
|
|
@@ -2,8 +2,9 @@
|
|
|
2
2
|
id: hatch3r-dependency-management
|
|
3
3
|
type: rule
|
|
4
4
|
description: Rules for managing project dependencies
|
|
5
|
-
scope:
|
|
5
|
+
scope: "**/package.json,**/package-lock.json,**/yarn.lock,**/pnpm-lock.yaml,**/Cargo.toml,**/Cargo.lock,**/requirements*.txt,**/pyproject.toml,**/go.mod,**/go.sum,**/Gemfile*"
|
|
6
6
|
tags: [maintenance]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Dependency Management
|
|
9
10
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Rules for managing project dependencies
|
|
3
|
-
|
|
3
|
+
globs: ["**/package.json", "**/package-lock.json", "**/yarn.lock", "**/pnpm-lock.yaml", "**/Cargo.toml", "**/Cargo.lock", "**/requirements*.txt", "**/pyproject.toml", "**/go.mod", "**/go.sum", "**/Gemfile*"]
|
|
4
4
|
---
|
|
5
5
|
# Dependency Management
|
|
6
6
|
|
|
@@ -3,7 +3,9 @@ id: hatch3r-feature-flags
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Feature flag patterns and lifecycle for the project
|
|
5
5
|
scope: conditional
|
|
6
|
+
globs: "**/*feature-flag*,**/*featureFlag*,**/*feature_flag*,**/config/**"
|
|
6
7
|
tags: [implementation]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
9
|
---
|
|
8
10
|
# Feature Flags
|
|
9
11
|
|
|
@@ -2,8 +2,9 @@
|
|
|
2
2
|
id: hatch3r-git-conventions
|
|
3
3
|
type: rule
|
|
4
4
|
description: Git commit message and branching conventions
|
|
5
|
-
scope:
|
|
5
|
+
scope: "**/.git/**,**/.gitignore,**/.gitattributes,**/.gitmodules,**/COMMIT_EDITMSG"
|
|
6
6
|
tags: [core]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Git Conventions
|
|
9
10
|
|
package/rules/hatch3r-i18n.md
CHANGED
|
@@ -5,6 +5,7 @@ description: Internationalization, localization, and RTL support conventions for
|
|
|
5
5
|
scope: conditional
|
|
6
6
|
globs: src/**/*.vue, src/**/*.tsx, src/**/*.jsx, src/**/*.ts
|
|
7
7
|
tags: [implementation]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
8
9
|
---
|
|
9
10
|
# Internationalization & RTL
|
|
10
11
|
|
|
@@ -43,7 +44,7 @@ tags: [implementation]
|
|
|
43
44
|
|
|
44
45
|
- Allow 30–40% text expansion for German/Finnish translations relative to English; test UI with pseudo-localization that pads strings.
|
|
45
46
|
- Use `min-height` instead of fixed `height` on text containers to accommodate CJK line-height requirements (1.6–1.8 recommended).
|
|
46
|
-
- Define font stacks per script family: Latin, CJK, Arabic, Devanagari —
|
|
47
|
+
- Define font stacks per script family: Latin, CJK, Arabic, Devanagari — each stack must include a web-safe fallback and `system-ui`.
|
|
47
48
|
- Truncate overflowing text with `text-overflow: ellipsis` plus `overflow: hidden` and `white-space: nowrap` only when semantically safe; provide title/tooltip with full text.
|
|
48
49
|
- Avoid fixed-width containers for translatable text; prefer `min-width` / `max-width` with flex/grid layout.
|
|
49
50
|
|
package/rules/hatch3r-i18n.mdc
CHANGED
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Internationalization, localization, and RTL support conventions for the project
|
|
3
|
+
globs: ["src/**/*.vue", "src/**/*.tsx", "src/**/*.jsx", "src/**/*.ts", "**/locales/**", "**/i18n/**", "**/*i18n*", "**/*locale*"]
|
|
3
4
|
alwaysApply: false
|
|
4
5
|
---
|
|
5
6
|
# Internationalization & RTL
|
|
@@ -2,8 +2,9 @@
|
|
|
2
2
|
id: hatch3r-learning-consult
|
|
3
3
|
type: rule
|
|
4
4
|
description: Auto-consult project learnings before implementation
|
|
5
|
-
scope:
|
|
5
|
+
scope: "**/.agents/learnings/**,**/learnings/**"
|
|
6
6
|
tags: [core]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Learning Consultation
|
|
9
10
|
|
|
@@ -29,3 +30,12 @@ Before implementing any task, check `.agents/learnings/` for relevant past learn
|
|
|
29
30
|
- **Pitfalls** are highest priority -- always surface these.
|
|
30
31
|
- **Patterns** are surfaced when the current task matches their trigger conditions.
|
|
31
32
|
- **Decisions** are surfaced when the current task might conflict with a past decision.
|
|
33
|
+
|
|
34
|
+
## Consultation Efficiency
|
|
35
|
+
|
|
36
|
+
To avoid excessive token usage during consultation:
|
|
37
|
+
|
|
38
|
+
1. **Scan frontmatter first.** Read only the YAML frontmatter (`tags`, `area`, `category`) of each learning file to determine relevance. Only read the full body of learnings that match the current task.
|
|
39
|
+
2. **Limit surfaced learnings.** Present at most 5 relevant learnings per consultation. If more are relevant, prioritize by confidence level (high > medium > low) and recency.
|
|
40
|
+
3. **Cache consultation results.** If consultation was already performed for this task (e.g., during board-pickup), do not re-consult during skill execution. The orchestrator passes relevant learnings to subagents as part of prompt enrichment.
|
|
41
|
+
4. **Skip when empty.** If `.agents/learnings/` has fewer than 3 files, consultation overhead exceeds value. Skip silently.
|
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Auto-consult project learnings before implementation
|
|
3
|
-
|
|
3
|
+
globs: "**/.agents/learnings/**,**/learnings/**"
|
|
4
|
+
alwaysApply: false
|
|
4
5
|
---
|
|
5
6
|
# Learning Consultation
|
|
6
7
|
|
|
@@ -26,3 +27,12 @@ Before implementing any task, check `.agents/learnings/` for relevant past learn
|
|
|
26
27
|
- **Pitfalls** are highest priority -- always surface these.
|
|
27
28
|
- **Patterns** are surfaced when the current task matches their trigger conditions.
|
|
28
29
|
- **Decisions** are surfaced when the current task might conflict with a past decision.
|
|
30
|
+
|
|
31
|
+
## Consultation Efficiency
|
|
32
|
+
|
|
33
|
+
To avoid excessive token usage during consultation:
|
|
34
|
+
|
|
35
|
+
1. **Scan frontmatter first.** Read only the YAML frontmatter (`tags`, `area`, `category`) of each learning file to determine relevance. Only read the full body of learnings that match the current task.
|
|
36
|
+
2. **Limit surfaced learnings.** Present at most 5 relevant learnings per consultation. If more are relevant, prioritize by confidence level (high > medium > low) and recency.
|
|
37
|
+
3. **Cache consultation results.** If consultation was already performed for this task (e.g., during board-pickup), do not re-consult during skill execution. The orchestrator passes relevant learnings to subagents as part of prompt enrichment.
|
|
38
|
+
4. **Skip when empty.** If `.agents/learnings/` has fewer than 3 files, consultation overhead exceeds value. Skip silently.
|
|
@@ -2,8 +2,9 @@
|
|
|
2
2
|
id: hatch3r-migrations
|
|
3
3
|
type: rule
|
|
4
4
|
description: Database migration and schema change patterns for the project
|
|
5
|
-
scope:
|
|
5
|
+
scope: "**/migrations/**,**/*migration*,**/migrate/**,**/seeds/**,**/seeders/**,**/prisma/migrations/**,**/drizzle/**,**/knex/**"
|
|
6
6
|
tags: [implementation, brownfield]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
7
8
|
---
|
|
8
9
|
# Migrations
|
|
9
10
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Database migration and schema change patterns for the project
|
|
3
|
-
|
|
3
|
+
globs: ["**/migrations/**", "**/*migration*", "**/migrate/**", "**/seeds/**", "**/seeders/**", "**/prisma/migrations/**", "**/drizzle/**", "**/knex/**"]
|
|
4
4
|
---
|
|
5
5
|
# Migrations
|
|
6
6
|
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-observability-logging
|
|
3
|
+
type: rule
|
|
4
|
+
description: Structured logging and error reporting conventions for the project
|
|
5
|
+
scope: conditional
|
|
6
|
+
globs: "**/*log*,**/*logger*,**/*logging*,**/*error*,**/observability/**"
|
|
7
|
+
tags: [devops]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
9
|
+
---
|
|
10
|
+
# Observability -- Logging & Error Reporting
|
|
11
|
+
|
|
12
|
+
Logging and error reporting conventions. For metrics, SLOs, alerting, and dashboards see `hatch3r-observability-metrics`. For distributed tracing and OpenTelemetry conventions see `hatch3r-observability-tracing`.
|
|
13
|
+
|
|
14
|
+
## Structured Logging
|
|
15
|
+
|
|
16
|
+
- Use structured JSON logging. No `console.log` in production code.
|
|
17
|
+
- Log levels: `error` (failures), `warn` (degraded), `info` (state changes), `debug` (dev only).
|
|
18
|
+
- Every log entry includes `correlationId` and `userId` (if available).
|
|
19
|
+
- Never log secrets, PII, tokens, passwords, or sensitive content.
|
|
20
|
+
- Instrument key operations with timing metrics. Serverless functions log execution time and outcome.
|
|
21
|
+
- Client-side: log errors to a sink (e.g., error reporting service), not just `console.error`.
|
|
22
|
+
- Prefer event-based metrics over polling. Trace user flows end-to-end with `correlationId`.
|
|
23
|
+
- Respect performance budgets: logging must not add > 10ms latency to hot paths.
|
|
24
|
+
- Include `service`, `environment`, and `version` fields in every log entry for filtering.
|
|
25
|
+
- Use log sampling for high-volume debug logs in production (e.g., 1% sample rate).
|
|
26
|
+
|
|
27
|
+
## Structured Error Reporting
|
|
28
|
+
|
|
29
|
+
- Integrate Sentry (or equivalent) for automated error capture in both server and client environments.
|
|
30
|
+
- Configure release tracking: tag errors with `release` (git SHA or semver) and upload source maps for readable stack traces.
|
|
31
|
+
- Enable breadcrumbs: capture the last 50 user actions, network requests, and console messages leading to an error.
|
|
32
|
+
- Error grouping: use custom fingerprints for domain-specific errors to prevent over-grouping. Default fingerprinting is acceptable for unhandled exceptions.
|
|
33
|
+
- Enrich error context with `correlationId`, `userId`, environment, and relevant business state. Never attach PII or secrets.
|
|
34
|
+
- Set sample rates: 100% for errors, 10% for transactions in production. Adjust based on volume and budget.
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Structured logging and error reporting conventions for the project
|
|
3
|
+
globs: ["**/*log*", "**/*logger*", "**/*logging*", "**/*error*", "**/observability/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# Observability -- Logging & Error Reporting
|
|
7
|
+
|
|
8
|
+
Logging and error reporting conventions. For metrics, SLOs, alerting, and dashboards see `hatch3r-observability-metrics`. For distributed tracing and OpenTelemetry conventions see `hatch3r-observability-tracing`.
|
|
9
|
+
|
|
10
|
+
## Structured Logging
|
|
11
|
+
|
|
12
|
+
- Use structured JSON logging. No `console.log` in production code.
|
|
13
|
+
- Log levels: `error` (failures), `warn` (degraded), `info` (state changes), `debug` (dev only).
|
|
14
|
+
- Every log entry includes `correlationId` and `userId` (if available).
|
|
15
|
+
- Never log secrets, PII, tokens, passwords, or sensitive content.
|
|
16
|
+
- Instrument key operations with timing metrics. Serverless functions log execution time and outcome.
|
|
17
|
+
- Client-side: log errors to a sink (e.g., error reporting service), not just `console.error`.
|
|
18
|
+
- Prefer event-based metrics over polling. Trace user flows end-to-end with `correlationId`.
|
|
19
|
+
- Respect performance budgets: logging must not add > 10ms latency to hot paths.
|
|
20
|
+
- Include `service`, `environment`, and `version` fields in every log entry for filtering.
|
|
21
|
+
- Use log sampling for high-volume debug logs in production (e.g., 1% sample rate).
|
|
22
|
+
|
|
23
|
+
## Structured Error Reporting
|
|
24
|
+
|
|
25
|
+
- Integrate Sentry (or equivalent) for automated error capture in both server and client environments.
|
|
26
|
+
- Configure release tracking: tag errors with `release` (git SHA or semver) and upload source maps for readable stack traces.
|
|
27
|
+
- Enable breadcrumbs: capture the last 50 user actions, network requests, and console messages leading to an error.
|
|
28
|
+
- Error grouping: use custom fingerprints for domain-specific errors to prevent over-grouping. Default fingerprinting is acceptable for unhandled exceptions.
|
|
29
|
+
- Enrich error context with `correlationId`, `userId`, environment, and relevant business state. Never attach PII or secrets.
|
|
30
|
+
- Set sample rates: 100% for errors, 10% for transactions in production. Adjust based on volume and budget.
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-observability-metrics
|
|
3
|
+
type: rule
|
|
4
|
+
description: Metrics, SLO/SLI definitions, alerting, and dashboard conventions for the project
|
|
5
|
+
scope: conditional
|
|
6
|
+
globs: "**/*metric*,**/*slo*,**/*sli*,**/*alert*,**/*dashboard*,**/observability/**"
|
|
7
|
+
tags: [devops]
|
|
8
|
+
quality_charter: agents/shared/quality-charter.md
|
|
9
|
+
---
|
|
10
|
+
# Observability -- Metrics, SLOs & Alerting
|
|
11
|
+
|
|
12
|
+
Metrics, SLO/SLI, alerting, and dashboard conventions. For structured logging see `hatch3r-observability-logging`. For distributed tracing and OpenTelemetry conventions see `hatch3r-observability-tracing`.
|
|
13
|
+
|
|
14
|
+
## Metrics
|
|
15
|
+
|
|
16
|
+
- Use OpenTelemetry Metrics SDK. Expose Prometheus-compatible `/metrics` endpoint for scraping where applicable.
|
|
17
|
+
- Metric naming: `{service}.{domain}.{metric}_{unit}` in snake_case. Example: `api.auth.login_duration_ms`.
|
|
18
|
+
- Instrument types and when to use:
|
|
19
|
+
|
|
20
|
+
| Instrument | Use Case | Example |
|
|
21
|
+
| ----------- | ---------------------------------- | -------------------------------- |
|
|
22
|
+
| Counter | Monotonically increasing totals | `http.requests_total` |
|
|
23
|
+
| Histogram | Distributions (latency, size) | `http.request_duration_ms` |
|
|
24
|
+
| Gauge | Point-in-time values | `db.connection_pool_active` |
|
|
25
|
+
| UpDownCounter | Values that increase and decrease | `queue.messages_pending` |
|
|
26
|
+
|
|
27
|
+
- Histogram buckets for latency: `[5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]` ms.
|
|
28
|
+
- Cardinality management: never use unbounded values (user IDs, request paths with params) as metric labels. Cap label cardinality to < 100 unique values per metric.
|
|
29
|
+
- Custom business metrics: track domain-significant events (sign-ups, purchases, feature usage) as counters with relevant dimensions.
|
|
30
|
+
|
|
31
|
+
## SLO / SLI Definitions
|
|
32
|
+
|
|
33
|
+
- Define SLIs as ratios of good events to total events, measured from the user's perspective.
|
|
34
|
+
- Standard SLIs:
|
|
35
|
+
|
|
36
|
+
| SLI | Definition | Measurement Source |
|
|
37
|
+
| ---------------- | --------------------------------------------- | ------------------------ |
|
|
38
|
+
| Availability | Requests returning non-5xx / total requests | Load balancer logs |
|
|
39
|
+
| Latency | Requests completing < threshold / total | Tracing p99 |
|
|
40
|
+
| Error rate | Failed operations / total operations | Application metrics |
|
|
41
|
+
| Freshness | Data updated within SLA / total records | Background job metrics |
|
|
42
|
+
|
|
43
|
+
- SLO targets: set per-service. Typical starting points: 99.9% availability (43 min/month budget), p99 latency < 500ms.
|
|
44
|
+
- Error budgets: `budget = 1 - SLO_target`. Track remaining budget on a rolling 30-day window.
|
|
45
|
+
- Burn rate alerts: use multi-window approach (short + long window). Fast-burn alert: 2% budget consumed in 1 hour. Slow-burn alert: 5% consumed in 6 hours. Alert only when both windows confirm.
|
|
46
|
+
|
|
47
|
+
## Alerting
|
|
48
|
+
|
|
49
|
+
| Severity | Criteria | Response Time | Notification |
|
|
50
|
+
| -------- | ----------------------------------- | ------------- | ------------------- |
|
|
51
|
+
| P1 | Service down, data loss risk | 15 min | Page on-call + Slack |
|
|
52
|
+
| P2 | Degraded performance, SLO at risk | 1 hour | Page on-call |
|
|
53
|
+
| P3 | Non-critical issue, workaround exists | Next business day | Slack channel |
|
|
54
|
+
| P4 | Cosmetic / low-impact | Sprint backlog | Ticket only |
|
|
55
|
+
|
|
56
|
+
- Every alert must link to a runbook with: symptoms, likely causes, diagnostic steps, remediation actions.
|
|
57
|
+
- Alert fatigue prevention: tune thresholds to < 5 actionable alerts per on-call shift. Suppress duplicate alerts within a 10-minute dedup window.
|
|
58
|
+
- Route alerts by service ownership. Use escalation policies: if P1/P2 unacknowledged in 15 min, escalate to secondary.
|
|
59
|
+
- Review alert quality monthly: snooze/delete alerts with < 20% action rate.
|
|
60
|
+
|
|
61
|
+
## Dashboard Standards
|
|
62
|
+
|
|
63
|
+
- Required dashboards per service:
|
|
64
|
+
|
|
65
|
+
| Dashboard | Contents |
|
|
66
|
+
| ---------------- | ----------------------------------------------------------- |
|
|
67
|
+
| Service Health | Request rate, error rate, latency p50/p95/p99, saturation |
|
|
68
|
+
| Business Metrics | Key domain counters, conversion funnels, feature adoption |
|
|
69
|
+
| Dependencies | Upstream/downstream latency, error rates, circuit breaker state |
|
|
70
|
+
| Infrastructure | CPU, memory, disk, connection pools, queue depth |
|
|
71
|
+
|
|
72
|
+
- Dashboard-as-code: define dashboards in version-controlled JSON/YAML (Grafana provisioning, Terraform, or equivalent). No manual dashboard creation in production.
|
|
73
|
+
- Every dashboard panel includes: descriptive title, unit labels, threshold lines for SLO targets, and a link to the relevant runbook or alert.
|
|
74
|
+
- Review dashboards quarterly: remove unused panels, update thresholds, verify data source accuracy.
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Metrics, SLO/SLI definitions, alerting, and dashboard conventions for the project
|
|
3
|
+
globs: ["**/*metric*", "**/*slo*", "**/*sli*", "**/*alert*", "**/*dashboard*", "**/observability/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# Observability -- Metrics, SLOs & Alerting
|
|
7
|
+
|
|
8
|
+
Metrics, SLO/SLI, alerting, and dashboard conventions. For structured logging see `hatch3r-observability-logging`. For distributed tracing and OpenTelemetry conventions see `hatch3r-observability-tracing`.
|
|
9
|
+
|
|
10
|
+
## Metrics
|
|
11
|
+
|
|
12
|
+
- Use OpenTelemetry Metrics SDK. Expose Prometheus-compatible `/metrics` endpoint for scraping where applicable.
|
|
13
|
+
- Metric naming: `{service}.{domain}.{metric}_{unit}` in snake_case. Example: `api.auth.login_duration_ms`.
|
|
14
|
+
- Instrument types and when to use:
|
|
15
|
+
|
|
16
|
+
| Instrument | Use Case | Example |
|
|
17
|
+
| ----------- | ---------------------------------- | -------------------------------- |
|
|
18
|
+
| Counter | Monotonically increasing totals | `http.requests_total` |
|
|
19
|
+
| Histogram | Distributions (latency, size) | `http.request_duration_ms` |
|
|
20
|
+
| Gauge | Point-in-time values | `db.connection_pool_active` |
|
|
21
|
+
| UpDownCounter | Values that increase and decrease | `queue.messages_pending` |
|
|
22
|
+
|
|
23
|
+
- Histogram buckets for latency: `[5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]` ms.
|
|
24
|
+
- Cardinality management: never use unbounded values (user IDs, request paths with params) as metric labels. Cap label cardinality to < 100 unique values per metric.
|
|
25
|
+
- Custom business metrics: track domain-significant events (sign-ups, purchases, feature usage) as counters with relevant dimensions.
|
|
26
|
+
|
|
27
|
+
## SLO / SLI Definitions
|
|
28
|
+
|
|
29
|
+
- Define SLIs as ratios of good events to total events, measured from the user's perspective.
|
|
30
|
+
- Standard SLIs:
|
|
31
|
+
|
|
32
|
+
| SLI | Definition | Measurement Source |
|
|
33
|
+
| ---------------- | --------------------------------------------- | ------------------------ |
|
|
34
|
+
| Availability | Requests returning non-5xx / total requests | Load balancer logs |
|
|
35
|
+
| Latency | Requests completing < threshold / total | Tracing p99 |
|
|
36
|
+
| Error rate | Failed operations / total operations | Application metrics |
|
|
37
|
+
| Freshness | Data updated within SLA / total records | Background job metrics |
|
|
38
|
+
|
|
39
|
+
- SLO targets: set per-service. Typical starting points: 99.9% availability (43 min/month budget), p99 latency < 500ms.
|
|
40
|
+
- Error budgets: `budget = 1 - SLO_target`. Track remaining budget on a rolling 30-day window.
|
|
41
|
+
- Burn rate alerts: use multi-window approach (short + long window). Fast-burn alert: 2% budget consumed in 1 hour. Slow-burn alert: 5% consumed in 6 hours. Alert only when both windows confirm.
|
|
42
|
+
|
|
43
|
+
## Alerting
|
|
44
|
+
|
|
45
|
+
| Severity | Criteria | Response Time | Notification |
|
|
46
|
+
| -------- | ----------------------------------- | ------------- | ------------------- |
|
|
47
|
+
| P1 | Service down, data loss risk | 15 min | Page on-call + Slack |
|
|
48
|
+
| P2 | Degraded performance, SLO at risk | 1 hour | Page on-call |
|
|
49
|
+
| P3 | Non-critical issue, workaround exists | Next business day | Slack channel |
|
|
50
|
+
| P4 | Cosmetic / low-impact | Sprint backlog | Ticket only |
|
|
51
|
+
|
|
52
|
+
- Every alert must link to a runbook with: symptoms, likely causes, diagnostic steps, remediation actions.
|
|
53
|
+
- Alert fatigue prevention: tune thresholds to < 5 actionable alerts per on-call shift. Suppress duplicate alerts within a 10-minute dedup window.
|
|
54
|
+
- Route alerts by service ownership. Use escalation policies: if P1/P2 unacknowledged in 15 min, escalate to secondary.
|
|
55
|
+
- Review alert quality monthly: snooze/delete alerts with < 20% action rate.
|
|
56
|
+
|
|
57
|
+
## Dashboard Standards
|
|
58
|
+
|
|
59
|
+
- Required dashboards per service:
|
|
60
|
+
|
|
61
|
+
| Dashboard | Contents |
|
|
62
|
+
| ---------------- | ----------------------------------------------------------- |
|
|
63
|
+
| Service Health | Request rate, error rate, latency p50/p95/p99, saturation |
|
|
64
|
+
| Business Metrics | Key domain counters, conversion funnels, feature adoption |
|
|
65
|
+
| Dependencies | Upstream/downstream latency, error rates, circuit breaker state |
|
|
66
|
+
| Infrastructure | CPU, memory, disk, connection pools, queue depth |
|
|
67
|
+
|
|
68
|
+
- Dashboard-as-code: define dashboards in version-controlled JSON/YAML (Grafana provisioning, Terraform, or equivalent). No manual dashboard creation in production.
|
|
69
|
+
- Every dashboard panel includes: descriptive title, unit labels, threshold lines for SLO targets, and a link to the relevant runbook or alert.
|
|
70
|
+
- Review dashboards quarterly: remove unused panels, update thresholds, verify data source accuracy.
|