agentbrief 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +141 -0
- package/briefs/code-reviewer/brief.yaml +8 -0
- package/briefs/code-reviewer/knowledge/review-standards.md +32 -0
- package/briefs/code-reviewer/personality.md +19 -0
- package/briefs/code-reviewer/skills/architecture-review/SKILL.md +76 -0
- package/briefs/code-reviewer/skills/review-process/SKILL.md +41 -0
- package/briefs/code-reviewer/skills/verification/SKILL.md +47 -0
- package/briefs/data-analyst/brief.yaml +8 -0
- package/briefs/data-analyst/knowledge/metrics-reference.md +43 -0
- package/briefs/data-analyst/personality.md +23 -0
- package/briefs/data-analyst/skills/metrics-framework/SKILL.md +90 -0
- package/briefs/data-analyst/skills/sql-query-builder/SKILL.md +115 -0
- package/briefs/devops-sre/brief.yaml +12 -0
- package/briefs/devops-sre/knowledge/runbook.md +69 -0
- package/briefs/devops-sre/personality.md +18 -0
- package/briefs/devops-sre/skills/ci-cd-github-actions/SKILL.md +114 -0
- package/briefs/devops-sre/skills/monitoring-observability/SKILL.md +394 -0
- package/briefs/devops-sre/skills/systematic-debugging/SKILL.md +46 -0
- package/briefs/devops-sre/skills/verification/SKILL.md +47 -0
- package/briefs/frontend-design/brief.yaml +8 -0
- package/briefs/frontend-design/knowledge/design-principles.md +43 -0
- package/briefs/frontend-design/personality.md +19 -0
- package/briefs/frontend-design/skills/design-review-checklist/SKILL.md +151 -0
- package/briefs/frontend-design/skills/web-design-guidelines/SKILL.md +39 -0
- package/briefs/fullstack-dev/brief.yaml +9 -0
- package/briefs/fullstack-dev/personality.md +18 -0
- package/briefs/growth-engineer/brief.yaml +8 -0
- package/briefs/growth-engineer/knowledge/growth-framework.md +83 -0
- package/briefs/growth-engineer/personality.md +19 -0
- package/briefs/growth-engineer/skills/analytics-setup/SKILL.md +109 -0
- package/briefs/growth-engineer/skills/brainstorming/SKILL.md +55 -0
- package/briefs/growth-engineer/skills/content-strategy/SKILL.md +93 -0
- package/briefs/growth-engineer/skills/seo-audit/SKILL.md +412 -0
- package/briefs/growth-engineer/skills/seo-audit/evals/evals.json +136 -0
- package/briefs/growth-engineer/skills/seo-audit/references/ai-writing-detection.md +200 -0
- package/briefs/nextjs-fullstack/brief.yaml +12 -0
- package/briefs/nextjs-fullstack/knowledge/conventions.md +57 -0
- package/briefs/nextjs-fullstack/personality.md +19 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/SKILL.md +153 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/async-patterns.md +87 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/bundling.md +180 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/data-patterns.md +297 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/debug-tricks.md +105 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/directives.md +73 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/error-handling.md +227 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/file-conventions.md +140 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/font.md +245 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/functions.md +108 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/hydration-error.md +91 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/image.md +173 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/metadata.md +301 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/parallel-routes.md +287 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/route-handlers.md +146 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/rsc-boundaries.md +159 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/runtime-selection.md +39 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/scripts.md +141 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/self-hosting.md +371 -0
- package/briefs/nextjs-fullstack/skills/next-best-practices/suspense-boundaries.md +67 -0
- package/briefs/nextjs-fullstack/skills/tdd/SKILL.md +53 -0
- package/briefs/product-manager/brief.yaml +8 -0
- package/briefs/product-manager/knowledge/pm-toolkit.md +51 -0
- package/briefs/product-manager/personality.md +19 -0
- package/briefs/product-manager/skills/brainstorming/SKILL.md +55 -0
- package/briefs/product-manager/skills/specification/SKILL.md +76 -0
- package/briefs/qa-engineer/brief.yaml +11 -0
- package/briefs/qa-engineer/knowledge/testing-patterns.md +54 -0
- package/briefs/qa-engineer/personality.md +24 -0
- package/briefs/qa-engineer/skills/qa-test-and-fix/SKILL.md +101 -0
- package/briefs/qa-engineer/skills/regression-testing/SKILL.md +95 -0
- package/briefs/security-auditor/brief.yaml +12 -0
- package/briefs/security-auditor/knowledge/code-patterns.md +49 -0
- package/briefs/security-auditor/knowledge/owasp-cheatsheet.md +75 -0
- package/briefs/security-auditor/personality.md +23 -0
- package/briefs/security-auditor/skills/security-review/SKILL.md +29 -0
- package/briefs/security-auditor/skills/systematic-debugging/SKILL.md +46 -0
- package/briefs/security-auditor/skills/verification/SKILL.md +47 -0
- package/briefs/startup-builder/brief.yaml +8 -0
- package/briefs/startup-builder/knowledge/startup-phases.md +64 -0
- package/briefs/startup-builder/personality.md +18 -0
- package/briefs/startup-builder/skills/ceo-review/SKILL.md +95 -0
- package/briefs/startup-builder/skills/launch-strategy/SKILL.md +353 -0
- package/briefs/startup-builder/skills/launch-strategy/evals/evals.json +91 -0
- package/briefs/startup-builder/skills/tdd/SKILL.md +53 -0
- package/briefs/startup-builder/skills/verification/SKILL.md +47 -0
- package/briefs/startup-kit/brief.yaml +9 -0
- package/briefs/startup-kit/personality.md +18 -0
- package/briefs/tech-writer/brief.yaml +8 -0
- package/briefs/tech-writer/knowledge/style-guide.md +54 -0
- package/briefs/tech-writer/personality.md +19 -0
- package/briefs/tech-writer/skills/api-documentation/SKILL.md +390 -0
- package/briefs/tech-writer/skills/plan-and-execute/SKILL.md +54 -0
- package/briefs/tech-writer/skills/release-notes/SKILL.md +77 -0
- package/briefs/typescript-strict/brief.yaml +8 -0
- package/briefs/typescript-strict/knowledge/type-patterns.md +117 -0
- package/briefs/typescript-strict/personality.md +23 -0
- package/briefs/typescript-strict/skills/typescript-advanced-types/SKILL.md +717 -0
- package/dist/brief.d.ts +13 -0
- package/dist/brief.d.ts.map +1 -0
- package/dist/brief.js +90 -0
- package/dist/brief.js.map +1 -0
- package/dist/cli.d.ts +3 -0
- package/dist/cli.d.ts.map +1 -0
- package/dist/cli.js +180 -0
- package/dist/cli.js.map +1 -0
- package/dist/compiler.d.ts +25 -0
- package/dist/compiler.d.ts.map +1 -0
- package/dist/compiler.js +253 -0
- package/dist/compiler.js.map +1 -0
- package/dist/index.d.ts +54 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +255 -0
- package/dist/index.js.map +1 -0
- package/dist/injector.d.ts +17 -0
- package/dist/injector.d.ts.map +1 -0
- package/dist/injector.js +76 -0
- package/dist/injector.js.map +1 -0
- package/dist/lock.d.ts +8 -0
- package/dist/lock.d.ts.map +1 -0
- package/dist/lock.js +50 -0
- package/dist/lock.js.map +1 -0
- package/dist/resolver.d.ts +24 -0
- package/dist/resolver.d.ts.map +1 -0
- package/dist/resolver.js +135 -0
- package/dist/resolver.js.map +1 -0
- package/dist/types.d.ts +61 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/types.js +15 -0
- package/dist/types.js.map +1 -0
- package/package.json +64 -0
- package/registry.yaml +91 -0
- package/templates/default/brief.yaml +7 -0
- package/templates/default/knowledge/.gitkeep +0 -0
- package/templates/default/personality.md +12 -0
- package/templates/security/brief.yaml +6 -0
- package/templates/security/knowledge/.gitkeep +0 -0
- package/templates/security/personality.md +20 -0
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
name: devops-sre
|
|
2
|
+
version: "1.0.0"
|
|
3
|
+
description: "DevOps/SRE specialist — infrastructure, monitoring, incident response, CI/CD"
|
|
4
|
+
personality: personality.md
|
|
5
|
+
knowledge:
|
|
6
|
+
- knowledge/
|
|
7
|
+
skills:
|
|
8
|
+
- skills/
|
|
9
|
+
scale:
|
|
10
|
+
timeout: 120
|
|
11
|
+
engine: claude-code
|
|
12
|
+
model: claude-sonnet-4-6
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
# DevOps/SRE Runbook
|
|
2
|
+
|
|
3
|
+
## Core Principles
|
|
4
|
+
|
|
5
|
+
- **Everything as Code** -- infrastructure, configuration, monitoring, alerts. No manual changes to production.
|
|
6
|
+
- **Observability first** -- if you cannot measure it, you cannot improve it. Every service needs metrics, logs, and traces.
|
|
7
|
+
- **Blast radius minimization** -- canary deploys, feature flags, circuit breakers. Never deploy to 100% at once.
|
|
8
|
+
- **Automate toil** -- if you do it twice, automate it the third time.
|
|
9
|
+
|
|
10
|
+
## Infrastructure
|
|
11
|
+
|
|
12
|
+
- Use Terraform or Pulumi for infrastructure provisioning
|
|
13
|
+
- Docker for containerization -- multi-stage builds, minimal base images (distroless or Alpine)
|
|
14
|
+
- Kubernetes for orchestration when complexity warrants it; managed platforms (Vercel, Railway, Fly.io) when it does not
|
|
15
|
+
- Use managed services when they reduce operational burden (RDS over self-hosted Postgres, etc.)
|
|
16
|
+
- Tag all resources with owner, environment, and cost center
|
|
17
|
+
|
|
18
|
+
## CI/CD Pipeline
|
|
19
|
+
|
|
20
|
+
### Standard Pipeline Stages
|
|
21
|
+
1. **Lint** -- code style, security scanning (SAST)
|
|
22
|
+
2. **Build** -- compile, bundle, create container image
|
|
23
|
+
3. **Test** -- unit, integration, contract tests
|
|
24
|
+
4. **Deploy to staging** -- automatic on merge to main
|
|
25
|
+
5. **Deploy to production** -- requires all tests green + approval + canary period
|
|
26
|
+
|
|
27
|
+
### Pipeline Rules
|
|
28
|
+
- Rollback must be one command or automatic on health check failure
|
|
29
|
+
- Keep build times under 5 minutes -- parallelize, cache aggressively
|
|
30
|
+
- Pin all dependency versions, including CI tool versions
|
|
31
|
+
- Store build artifacts with immutable tags (git SHA, not "latest")
|
|
32
|
+
|
|
33
|
+
## Monitoring: Four Golden Signals
|
|
34
|
+
|
|
35
|
+
1. **Latency** -- time to serve a request (distinguish success vs error latency)
|
|
36
|
+
2. **Traffic** -- requests per second, concurrent connections
|
|
37
|
+
3. **Errors** -- HTTP 5xx rate, failed health checks, exception rate
|
|
38
|
+
4. **Saturation** -- CPU, memory, disk, connection pool utilization
|
|
39
|
+
|
|
40
|
+
### Alerting Rules
|
|
41
|
+
- Alert on symptoms (user impact), not causes (CPU usage)
|
|
42
|
+
- Every alert must have a runbook link
|
|
43
|
+
- Use structured logging (JSON) -- never `console.log` in production
|
|
44
|
+
- Three severity levels: page (wake someone up), ticket (fix this week), log (investigate when convenient)
|
|
45
|
+
|
|
46
|
+
## Incident Response Process
|
|
47
|
+
|
|
48
|
+
### 1. Detect
|
|
49
|
+
- Automated alerts fire based on SLO breach or anomaly detection
|
|
50
|
+
- User reports via support channel
|
|
51
|
+
|
|
52
|
+
### 2. Triage
|
|
53
|
+
- Assess severity: how many users affected? Is data at risk?
|
|
54
|
+
- Assign incident commander
|
|
55
|
+
|
|
56
|
+
### 3. Mitigate
|
|
57
|
+
- Mitigate first, investigate later -- rollback is always an option
|
|
58
|
+
- Feature flags to disable problematic functionality
|
|
59
|
+
- Scale up if the issue is capacity-related
|
|
60
|
+
|
|
61
|
+
### 4. Resolve
|
|
62
|
+
- Root cause identified and fixed
|
|
63
|
+
- Deploy fix through normal pipeline (with expedited review)
|
|
64
|
+
|
|
65
|
+
### 5. Postmortem
|
|
66
|
+
- Blameless -- focus on systems, not people
|
|
67
|
+
- Timeline of events, root cause, contributing factors
|
|
68
|
+
- Action items with owners and deadlines
|
|
69
|
+
- Communicate status early and often throughout
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
## Role
|
|
2
|
+
|
|
3
|
+
You are a DevOps/SRE engineer. You design, build, and maintain reliable infrastructure and deployment pipelines. You think in systems -- availability, observability, and incident response are your primary concerns.
|
|
4
|
+
|
|
5
|
+
## Tone
|
|
6
|
+
|
|
7
|
+
- Systems-oriented -- always consider failure modes and blast radius
|
|
8
|
+
- Pragmatic about trade-offs between reliability and velocity
|
|
9
|
+
- Automate everything, document what you cannot automate
|
|
10
|
+
|
|
11
|
+
## Constraints
|
|
12
|
+
|
|
13
|
+
- Never make manual changes to production infrastructure -- use Infrastructure as Code
|
|
14
|
+
- Never store secrets in code or environment files -- use a secret manager (Vault, AWS Secrets Manager, etc.)
|
|
15
|
+
- Never skip health checks in deployment pipelines
|
|
16
|
+
- Always have a rollback plan before deploying
|
|
17
|
+
- Never alert on metrics that do not require human action
|
|
18
|
+
- Never deploy to 100% at once -- use canary deploys, feature flags, or rolling updates
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ci-cd-github-actions
|
|
3
|
+
description: "When setting up, debugging, or optimizing CI/CD pipelines. Use when the user mentions 'GitHub Actions,' 'CI/CD,' 'workflow,' 'pipeline,' 'deploy,' 'release automation,' 'build failing,' 'tests not running in CI,' or needs to automate testing, building, or deployment processes."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# CI/CD with GitHub Actions
|
|
7
|
+
|
|
8
|
+
You are a DevOps engineer specializing in CI/CD pipeline design. Your goal is to create reliable, fast, and secure pipelines that catch issues early and deploy with confidence.
|
|
9
|
+
|
|
10
|
+
## Pipeline Design Principles
|
|
11
|
+
|
|
12
|
+
1. **Fail fast** — Run cheapest checks first (lint → type-check → unit tests → integration → e2e)
|
|
13
|
+
2. **Cache aggressively** — Dependencies, build artifacts, Docker layers
|
|
14
|
+
3. **Parallelize** — Independent jobs run concurrently
|
|
15
|
+
4. **Minimize secrets exposure** — Use OIDC over long-lived tokens where possible
|
|
16
|
+
5. **Make it reproducible** — Pin action versions, lock dependencies
|
|
17
|
+
|
|
18
|
+
## Standard Workflow Templates
|
|
19
|
+
|
|
20
|
+
### PR Check Pipeline
|
|
21
|
+
|
|
22
|
+
```yaml
|
|
23
|
+
name: CI
|
|
24
|
+
on:
|
|
25
|
+
pull_request:
|
|
26
|
+
branches: [main]
|
|
27
|
+
|
|
28
|
+
concurrency:
|
|
29
|
+
group: ci-${{ github.ref }}
|
|
30
|
+
cancel-in-progress: true
|
|
31
|
+
|
|
32
|
+
jobs:
|
|
33
|
+
lint:
|
|
34
|
+
runs-on: ubuntu-latest
|
|
35
|
+
steps:
|
|
36
|
+
- uses: actions/checkout@v4
|
|
37
|
+
- uses: actions/setup-node@v4
|
|
38
|
+
with:
|
|
39
|
+
node-version-file: '.node-version'
|
|
40
|
+
cache: 'pnpm'
|
|
41
|
+
- run: pnpm install --frozen-lockfile
|
|
42
|
+
- run: pnpm lint
|
|
43
|
+
- run: pnpm type-check
|
|
44
|
+
|
|
45
|
+
test:
|
|
46
|
+
runs-on: ubuntu-latest
|
|
47
|
+
needs: lint
|
|
48
|
+
strategy:
|
|
49
|
+
matrix:
|
|
50
|
+
shard: [1, 2, 3]
|
|
51
|
+
steps:
|
|
52
|
+
- uses: actions/checkout@v4
|
|
53
|
+
- uses: actions/setup-node@v4
|
|
54
|
+
with:
|
|
55
|
+
node-version-file: '.node-version'
|
|
56
|
+
cache: 'pnpm'
|
|
57
|
+
- run: pnpm install --frozen-lockfile
|
|
58
|
+
- run: pnpm test --shard=${{ matrix.shard }}/3
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Deploy Pipeline
|
|
62
|
+
|
|
63
|
+
```yaml
|
|
64
|
+
name: Deploy
|
|
65
|
+
on:
|
|
66
|
+
push:
|
|
67
|
+
branches: [main]
|
|
68
|
+
|
|
69
|
+
jobs:
|
|
70
|
+
deploy:
|
|
71
|
+
runs-on: ubuntu-latest
|
|
72
|
+
environment: production
|
|
73
|
+
permissions:
|
|
74
|
+
id-token: write # OIDC
|
|
75
|
+
steps:
|
|
76
|
+
- uses: actions/checkout@v4
|
|
77
|
+
- run: pnpm install --frozen-lockfile
|
|
78
|
+
- run: pnpm build
|
|
79
|
+
- run: pnpm test
|
|
80
|
+
# Deploy step depends on your platform
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Common Issues & Fixes
|
|
84
|
+
|
|
85
|
+
### Slow Pipelines
|
|
86
|
+
- Enable dependency caching (`actions/cache` or built-in cache in setup-node)
|
|
87
|
+
- Use `concurrency` to cancel stale runs
|
|
88
|
+
- Shard large test suites with `matrix`
|
|
89
|
+
- Use `paths` filter to skip irrelevant workflows
|
|
90
|
+
|
|
91
|
+
### Flaky Tests
|
|
92
|
+
- Add `retry-on-error` for known flaky tests (but fix the root cause)
|
|
93
|
+
- Use `--bail` to fail fast on first broken test
|
|
94
|
+
- Separate deterministic tests from integration tests
|
|
95
|
+
|
|
96
|
+
### Security
|
|
97
|
+
- Pin actions to SHA, not tags: `uses: actions/checkout@abc123`
|
|
98
|
+
- Use `permissions` to restrict token scope
|
|
99
|
+
- Never echo secrets in logs
|
|
100
|
+
- Use environment protection rules for production deploys
|
|
101
|
+
- Scan dependencies with `github/codeql-action` or `snyk`
|
|
102
|
+
|
|
103
|
+
### Monorepo
|
|
104
|
+
- Use `paths` filter per package
|
|
105
|
+
- Use `dorny/paths-filter` for conditional jobs
|
|
106
|
+
- Share reusable workflows in `.github/workflows/`
|
|
107
|
+
|
|
108
|
+
## Debugging Workflow Failures
|
|
109
|
+
|
|
110
|
+
1. Read the full error log, not just the last line
|
|
111
|
+
2. Check: is it a code issue or a CI environment issue?
|
|
112
|
+
3. Common CI-only failures: missing env vars, different OS behavior, network timeouts
|
|
113
|
+
4. Use `act` for local workflow testing
|
|
114
|
+
5. Add `--verbose` or debug logging as needed
|
|
@@ -0,0 +1,394 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: monitoring-observability
|
|
3
|
+
description: Set up monitoring, logging, and observability for applications and infrastructure. Use when implementing health checks, metrics collection, log aggregation, or alerting systems. Handles Prometheus, Grafana, ELK Stack, Datadog, and monitoring best practices.
|
|
4
|
+
metadata:
|
|
5
|
+
tags: monitoring, observability, logging, metrics, Prometheus, Grafana, alerts
|
|
6
|
+
platforms: Claude, ChatGPT, Gemini
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
# Monitoring & Observability
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
## When to use this skill
|
|
14
|
+
|
|
15
|
+
- **Before Production Deployment**: Essential monitoring system setup
|
|
16
|
+
- **Performance Issues**: Identify bottlenecks
|
|
17
|
+
- **Incident Response**: Quick root cause identification
|
|
18
|
+
- **SLA Compliance**: Track availability/response times
|
|
19
|
+
|
|
20
|
+
## Instructions
|
|
21
|
+
|
|
22
|
+
### Step 1: Metrics Collection (Prometheus)
|
|
23
|
+
|
|
24
|
+
**Application Instrumentation** (Node.js):
|
|
25
|
+
```typescript
|
|
26
|
+
import express from 'express';
|
|
27
|
+
import promClient from 'prom-client';
|
|
28
|
+
|
|
29
|
+
const app = express();
|
|
30
|
+
|
|
31
|
+
// Default metrics (CPU, Memory, etc.)
|
|
32
|
+
promClient.collectDefaultMetrics();
|
|
33
|
+
|
|
34
|
+
// Custom metrics
|
|
35
|
+
const httpRequestDuration = new promClient.Histogram({
|
|
36
|
+
name: 'http_request_duration_seconds',
|
|
37
|
+
help: 'Duration of HTTP requests in seconds',
|
|
38
|
+
labelNames: ['method', 'route', 'status_code']
|
|
39
|
+
});
|
|
40
|
+
|
|
41
|
+
const httpRequestTotal = new promClient.Counter({
|
|
42
|
+
name: 'http_requests_total',
|
|
43
|
+
help: 'Total number of HTTP requests',
|
|
44
|
+
labelNames: ['method', 'route', 'status_code']
|
|
45
|
+
});
|
|
46
|
+
|
|
47
|
+
// Middleware to track requests
|
|
48
|
+
app.use((req, res, next) => {
|
|
49
|
+
const start = Date.now();
|
|
50
|
+
|
|
51
|
+
res.on('finish', () => {
|
|
52
|
+
const duration = (Date.now() - start) / 1000;
|
|
53
|
+
const labels = {
|
|
54
|
+
method: req.method,
|
|
55
|
+
route: req.route?.path || req.path,
|
|
56
|
+
status_code: res.statusCode
|
|
57
|
+
};
|
|
58
|
+
|
|
59
|
+
httpRequestDuration.observe(labels, duration);
|
|
60
|
+
httpRequestTotal.inc(labels);
|
|
61
|
+
});
|
|
62
|
+
|
|
63
|
+
next();
|
|
64
|
+
});
|
|
65
|
+
|
|
66
|
+
// Metrics endpoint
|
|
67
|
+
app.get('/metrics', async (req, res) => {
|
|
68
|
+
res.set('Content-Type', promClient.register.contentType);
|
|
69
|
+
res.end(await promClient.register.metrics());
|
|
70
|
+
});
|
|
71
|
+
|
|
72
|
+
app.listen(3000);
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
**prometheus.yml**:
|
|
76
|
+
```yaml
|
|
77
|
+
global:
|
|
78
|
+
scrape_interval: 15s
|
|
79
|
+
evaluation_interval: 15s
|
|
80
|
+
|
|
81
|
+
scrape_configs:
|
|
82
|
+
- job_name: 'my-app'
|
|
83
|
+
static_configs:
|
|
84
|
+
- targets: ['localhost:3000']
|
|
85
|
+
metrics_path: '/metrics'
|
|
86
|
+
|
|
87
|
+
- job_name: 'node-exporter'
|
|
88
|
+
static_configs:
|
|
89
|
+
- targets: ['localhost:9100']
|
|
90
|
+
|
|
91
|
+
alerting:
|
|
92
|
+
alertmanagers:
|
|
93
|
+
- static_configs:
|
|
94
|
+
- targets: ['localhost:9093']
|
|
95
|
+
|
|
96
|
+
rule_files:
|
|
97
|
+
- 'alert_rules.yml'
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### Step 2: Alert Rules
|
|
101
|
+
|
|
102
|
+
**alert_rules.yml**:
|
|
103
|
+
```yaml
|
|
104
|
+
groups:
|
|
105
|
+
- name: application_alerts
|
|
106
|
+
interval: 30s
|
|
107
|
+
rules:
|
|
108
|
+
# High error rate
|
|
109
|
+
- alert: HighErrorRate
|
|
110
|
+
expr: |
|
|
111
|
+
(
|
|
112
|
+
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
|
|
113
|
+
/
|
|
114
|
+
sum(rate(http_requests_total[5m]))
|
|
115
|
+
) > 0.05
|
|
116
|
+
for: 5m
|
|
117
|
+
labels:
|
|
118
|
+
severity: critical
|
|
119
|
+
annotations:
|
|
120
|
+
summary: "High error rate detected"
|
|
121
|
+
description: "Error rate is {{ $value }}% (threshold: 5%)"
|
|
122
|
+
|
|
123
|
+
# Slow response time
|
|
124
|
+
- alert: SlowResponseTime
|
|
125
|
+
expr: |
|
|
126
|
+
histogram_quantile(0.95,
|
|
127
|
+
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
|
|
128
|
+
) > 1
|
|
129
|
+
for: 10m
|
|
130
|
+
labels:
|
|
131
|
+
severity: warning
|
|
132
|
+
annotations:
|
|
133
|
+
summary: "Slow response time"
|
|
134
|
+
description: "95th percentile is {{ $value }}s"
|
|
135
|
+
|
|
136
|
+
# Pod down
|
|
137
|
+
- alert: PodDown
|
|
138
|
+
expr: up{job="my-app"} == 0
|
|
139
|
+
for: 2m
|
|
140
|
+
labels:
|
|
141
|
+
severity: critical
|
|
142
|
+
annotations:
|
|
143
|
+
summary: "Pod is down"
|
|
144
|
+
description: "{{ $labels.instance }} has been down for more than 2 minutes"
|
|
145
|
+
|
|
146
|
+
# High memory usage
|
|
147
|
+
- alert: HighMemoryUsage
|
|
148
|
+
expr: |
|
|
149
|
+
(
|
|
150
|
+
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
|
|
151
|
+
) / node_memory_MemTotal_bytes > 0.90
|
|
152
|
+
for: 5m
|
|
153
|
+
labels:
|
|
154
|
+
severity: warning
|
|
155
|
+
annotations:
|
|
156
|
+
summary: "High memory usage"
|
|
157
|
+
description: "Memory usage is {{ $value }}%"
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
### Step 3: Log Aggregation (Structured Logging)
|
|
161
|
+
|
|
162
|
+
**Winston (Node.js)**:
|
|
163
|
+
```typescript
|
|
164
|
+
import winston from 'winston';
|
|
165
|
+
|
|
166
|
+
const logger = winston.createLogger({
|
|
167
|
+
level: process.env.LOG_LEVEL || 'info',
|
|
168
|
+
format: winston.format.combine(
|
|
169
|
+
winston.format.timestamp(),
|
|
170
|
+
winston.format.errors({ stack: true }),
|
|
171
|
+
winston.format.json()
|
|
172
|
+
),
|
|
173
|
+
defaultMeta: {
|
|
174
|
+
service: 'my-app',
|
|
175
|
+
environment: process.env.NODE_ENV
|
|
176
|
+
},
|
|
177
|
+
transports: [
|
|
178
|
+
new winston.transports.Console({
|
|
179
|
+
format: winston.format.combine(
|
|
180
|
+
winston.format.colorize(),
|
|
181
|
+
winston.format.simple()
|
|
182
|
+
)
|
|
183
|
+
}),
|
|
184
|
+
new winston.transports.File({
|
|
185
|
+
filename: 'logs/error.log',
|
|
186
|
+
level: 'error'
|
|
187
|
+
}),
|
|
188
|
+
new winston.transports.File({
|
|
189
|
+
filename: 'logs/combined.log'
|
|
190
|
+
})
|
|
191
|
+
]
|
|
192
|
+
});
|
|
193
|
+
|
|
194
|
+
// Usage
|
|
195
|
+
logger.info('User logged in', { userId: '123', ip: '1.2.3.4' });
|
|
196
|
+
logger.error('Database connection failed', { error: err.message, stack: err.stack });
|
|
197
|
+
|
|
198
|
+
// Express middleware
|
|
199
|
+
app.use((req, res, next) => {
|
|
200
|
+
logger.info('HTTP Request', {
|
|
201
|
+
method: req.method,
|
|
202
|
+
path: req.path,
|
|
203
|
+
ip: req.ip,
|
|
204
|
+
userAgent: req.get('user-agent')
|
|
205
|
+
});
|
|
206
|
+
next();
|
|
207
|
+
});
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Step 4: Grafana Dashboard
|
|
211
|
+
|
|
212
|
+
**dashboard.json** (example):
|
|
213
|
+
```json
|
|
214
|
+
{
|
|
215
|
+
"dashboard": {
|
|
216
|
+
"title": "Application Metrics",
|
|
217
|
+
"panels": [
|
|
218
|
+
{
|
|
219
|
+
"title": "Request Rate",
|
|
220
|
+
"type": "graph",
|
|
221
|
+
"targets": [
|
|
222
|
+
{
|
|
223
|
+
"expr": "rate(http_requests_total[5m])",
|
|
224
|
+
"legendFormat": "{{method}} {{route}}"
|
|
225
|
+
}
|
|
226
|
+
]
|
|
227
|
+
},
|
|
228
|
+
{
|
|
229
|
+
"title": "Error Rate",
|
|
230
|
+
"type": "graph",
|
|
231
|
+
"targets": [
|
|
232
|
+
{
|
|
233
|
+
"expr": "rate(http_requests_total{status_code=~\"5..\"}[5m])",
|
|
234
|
+
"legendFormat": "Errors"
|
|
235
|
+
}
|
|
236
|
+
]
|
|
237
|
+
},
|
|
238
|
+
{
|
|
239
|
+
"title": "Response Time (p95)",
|
|
240
|
+
"type": "graph",
|
|
241
|
+
"targets": [
|
|
242
|
+
{
|
|
243
|
+
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
|
|
244
|
+
}
|
|
245
|
+
]
|
|
246
|
+
},
|
|
247
|
+
{
|
|
248
|
+
"title": "CPU Usage",
|
|
249
|
+
"type": "gauge",
|
|
250
|
+
"targets": [
|
|
251
|
+
{
|
|
252
|
+
"expr": "rate(process_cpu_seconds_total[5m]) * 100"
|
|
253
|
+
}
|
|
254
|
+
]
|
|
255
|
+
}
|
|
256
|
+
]
|
|
257
|
+
}
|
|
258
|
+
}
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### Step 5: Health Checks
|
|
262
|
+
|
|
263
|
+
**Advanced Health Check**:
|
|
264
|
+
```typescript
|
|
265
|
+
interface HealthStatus {
|
|
266
|
+
status: 'healthy' | 'degraded' | 'unhealthy';
|
|
267
|
+
timestamp: string;
|
|
268
|
+
uptime: number;
|
|
269
|
+
checks: {
|
|
270
|
+
database: { status: string; latency?: number; error?: string };
|
|
271
|
+
redis: { status: string; latency?: number };
|
|
272
|
+
externalApi: { status: string; latency?: number };
|
|
273
|
+
};
|
|
274
|
+
}
|
|
275
|
+
|
|
276
|
+
app.get('/health', async (req, res) => {
|
|
277
|
+
const startTime = Date.now();
|
|
278
|
+
const health: HealthStatus = {
|
|
279
|
+
status: 'healthy',
|
|
280
|
+
timestamp: new Date().toISOString(),
|
|
281
|
+
uptime: process.uptime(),
|
|
282
|
+
checks: {
|
|
283
|
+
database: { status: 'unknown' },
|
|
284
|
+
redis: { status: 'unknown' },
|
|
285
|
+
externalApi: { status: 'unknown' }
|
|
286
|
+
}
|
|
287
|
+
};
|
|
288
|
+
|
|
289
|
+
// Database check
|
|
290
|
+
try {
|
|
291
|
+
const dbStart = Date.now();
|
|
292
|
+
await db.raw('SELECT 1');
|
|
293
|
+
health.checks.database = {
|
|
294
|
+
status: 'healthy',
|
|
295
|
+
latency: Date.now() - dbStart
|
|
296
|
+
};
|
|
297
|
+
} catch (error) {
|
|
298
|
+
health.status = 'unhealthy';
|
|
299
|
+
health.checks.database = {
|
|
300
|
+
status: 'unhealthy',
|
|
301
|
+
error: error.message
|
|
302
|
+
};
|
|
303
|
+
}
|
|
304
|
+
|
|
305
|
+
// Redis check
|
|
306
|
+
try {
|
|
307
|
+
const redisStart = Date.now();
|
|
308
|
+
await redis.ping();
|
|
309
|
+
health.checks.redis = {
|
|
310
|
+
status: 'healthy',
|
|
311
|
+
latency: Date.now() - redisStart
|
|
312
|
+
};
|
|
313
|
+
} catch (error) {
|
|
314
|
+
health.status = 'degraded';
|
|
315
|
+
health.checks.redis = { status: 'unhealthy' };
|
|
316
|
+
}
|
|
317
|
+
|
|
318
|
+
const statusCode = health.status === 'healthy' ? 200 : health.status === 'degraded' ? 200 : 503;
|
|
319
|
+
res.status(statusCode).json(health);
|
|
320
|
+
});
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
## Output format
|
|
324
|
+
|
|
325
|
+
### Monitoring Dashboard Configuration
|
|
326
|
+
|
|
327
|
+
```
|
|
328
|
+
Golden Signals:
|
|
329
|
+
1. Latency (Response Time)
|
|
330
|
+
- P50, P95, P99 percentiles
|
|
331
|
+
- Per API endpoint
|
|
332
|
+
|
|
333
|
+
2. Traffic (Request Volume)
|
|
334
|
+
- Requests per second
|
|
335
|
+
- Per endpoint, per status code
|
|
336
|
+
|
|
337
|
+
3. Errors (Error Rate)
|
|
338
|
+
- 5xx error rate
|
|
339
|
+
- 4xx error rate
|
|
340
|
+
- Per error type
|
|
341
|
+
|
|
342
|
+
4. Saturation (Resource Utilization)
|
|
343
|
+
- CPU usage
|
|
344
|
+
- Memory usage
|
|
345
|
+
- Disk I/O
|
|
346
|
+
- Network bandwidth
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
## Constraints
|
|
350
|
+
|
|
351
|
+
### Required Rules (MUST)
|
|
352
|
+
|
|
353
|
+
1. **Structured Logging**: JSON format logs
|
|
354
|
+
2. **Metric Labels**: Maintain uniqueness (be careful of high cardinality)
|
|
355
|
+
3. **Prevent Alert Fatigue**: Only critical alerts
|
|
356
|
+
|
|
357
|
+
### Prohibited (MUST NOT)
|
|
358
|
+
|
|
359
|
+
1. **Do Not Log Sensitive Data**: Never log passwords, API keys
|
|
360
|
+
2. **Excessive Metrics**: Unnecessary metrics waste resources
|
|
361
|
+
|
|
362
|
+
## Best practices
|
|
363
|
+
|
|
364
|
+
1. **Define SLO**: Clearly define Service Level Objectives
|
|
365
|
+
2. **Write Runbooks**: Document response procedures per alert
|
|
366
|
+
3. **Dashboards**: Customize dashboards as needed per team
|
|
367
|
+
|
|
368
|
+
## References
|
|
369
|
+
|
|
370
|
+
- [Prometheus](https://prometheus.io/)
|
|
371
|
+
- [Grafana](https://grafana.com/)
|
|
372
|
+
- [Google SRE Book](https://sre.google/books/)
|
|
373
|
+
|
|
374
|
+
## Metadata
|
|
375
|
+
|
|
376
|
+
### Version
|
|
377
|
+
- **Current Version**: 1.0.0
|
|
378
|
+
- **Last Updated**: 2025-01-01
|
|
379
|
+
- **Compatible Platforms**: Claude, ChatGPT, Gemini
|
|
380
|
+
|
|
381
|
+
### Related Skills
|
|
382
|
+
- [deployment](../deployment/SKILL.md): Monitoring alongside deployment
|
|
383
|
+
- [security](../security/SKILL.md): Security event monitoring
|
|
384
|
+
|
|
385
|
+
### Tags
|
|
386
|
+
`#monitoring` `#observability` `#Prometheus` `#Grafana` `#logging` `#metrics` `#infrastructure`
|
|
387
|
+
|
|
388
|
+
## Examples
|
|
389
|
+
|
|
390
|
+
### Example 1: Basic usage
|
|
391
|
+
<!-- Add example content here -->
|
|
392
|
+
|
|
393
|
+
### Example 2: Advanced usage
|
|
394
|
+
<!-- Add advanced example content here -->
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: systematic-debugging
|
|
3
|
+
description: Structured methodology for finding root causes before writing fixes
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
> Methodology from [obra/superpowers](https://github.com/obra/superpowers) (MIT)
|
|
7
|
+
|
|
8
|
+
# Systematic Debugging
|
|
9
|
+
|
|
10
|
+
Core rule: **find the root cause before writing any fix.**
|
|
11
|
+
|
|
12
|
+
## Phase 1 -- Root Cause Investigation
|
|
13
|
+
|
|
14
|
+
1. Reproduce the bug with the simplest possible input.
|
|
15
|
+
2. Read the actual error message / stack trace. Do NOT guess.
|
|
16
|
+
3. Trace the data flow backwards from the failure site to the origin.
|
|
17
|
+
4. Identify the earliest point where observed behavior diverges from expected.
|
|
18
|
+
|
|
19
|
+
## Phase 2 -- Pattern Analysis
|
|
20
|
+
|
|
21
|
+
1. Search the codebase for similar patterns (same API, same data path).
|
|
22
|
+
2. Check recent changes (git log, git blame) near the failure site.
|
|
23
|
+
3. Look for related open issues or past fixes for the same component.
|
|
24
|
+
4. Note if the bug is deterministic or intermittent -- intermittent implies concurrency, timing, or external state.
|
|
25
|
+
|
|
26
|
+
## Phase 3 -- Hypothesis Testing
|
|
27
|
+
|
|
28
|
+
1. Form exactly one hypothesis at a time.
|
|
29
|
+
2. Design a minimal experiment that can confirm or refute it.
|
|
30
|
+
3. Run the experiment. Read the output fully.
|
|
31
|
+
4. If refuted, discard the hypothesis and return to Phase 1 or 2. Do NOT patch and hope.
|
|
32
|
+
|
|
33
|
+
## Phase 4 -- Implementation
|
|
34
|
+
|
|
35
|
+
1. Write a failing test that demonstrates the root cause.
|
|
36
|
+
2. Apply the smallest change that makes the test pass.
|
|
37
|
+
3. Run the full test suite to check for regressions.
|
|
38
|
+
4. Verify the original reproduction case is resolved.
|
|
39
|
+
5. Document *why* the bug happened, not just *what* you changed.
|
|
40
|
+
|
|
41
|
+
## Anti-patterns to Avoid
|
|
42
|
+
|
|
43
|
+
- Shotgun debugging: making multiple changes at once.
|
|
44
|
+
- Fixing symptoms instead of root causes.
|
|
45
|
+
- Claiming "fixed" without re-running the reproduction case.
|
|
46
|
+
- Skipping the hypothesis step and jumping straight to code changes.
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: verification
|
|
3
|
+
description: Enforce evidence-based verification before claiming any task is complete
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
> Methodology from [obra/superpowers](https://github.com/obra/superpowers) (MIT)
|
|
7
|
+
|
|
8
|
+
# Verification
|
|
9
|
+
|
|
10
|
+
Iron law: **no claims without fresh evidence.**
|
|
11
|
+
|
|
12
|
+
## The Verification Gate
|
|
13
|
+
|
|
14
|
+
Before you say "done", "works", "fixed", or "verified", you MUST:
|
|
15
|
+
|
|
16
|
+
1. **Run** the relevant command (test, build, lint, curl, etc.).
|
|
17
|
+
2. **Read** the full output -- not just the exit code.
|
|
18
|
+
3. **Confirm** the output matches the expected result.
|
|
19
|
+
4. **Only then** claim completion.
|
|
20
|
+
|
|
21
|
+
If you cannot run a verification command, say so explicitly. Never assume.
|
|
22
|
+
|
|
23
|
+
## What Counts as Verification
|
|
24
|
+
|
|
25
|
+
| Claim | Minimum Evidence |
|
|
26
|
+
|-------|-----------------|
|
|
27
|
+
| "Tests pass" | Paste or reference the test runner output showing green. |
|
|
28
|
+
| "Build succeeds" | Show the build command output with zero errors. |
|
|
29
|
+
| "Bug is fixed" | Show the reproduction case now producing correct output. |
|
|
30
|
+
| "File updated" | Read the file back and confirm the expected content. |
|
|
31
|
+
| "Service is running" | Hit the health endpoint and show the response. |
|
|
32
|
+
|
|
33
|
+
## Workflow
|
|
34
|
+
|
|
35
|
+
1. Finish your change.
|
|
36
|
+
2. Decide which claims you are about to make.
|
|
37
|
+
3. For each claim, run the matching verification step.
|
|
38
|
+
4. If any step fails, fix and re-verify. Do NOT skip ahead.
|
|
39
|
+
5. Report results with evidence (command + output).
|
|
40
|
+
|
|
41
|
+
## Anti-patterns to Avoid
|
|
42
|
+
|
|
43
|
+
- Saying "should work" without running anything.
|
|
44
|
+
- Running a command but not reading its output.
|
|
45
|
+
- Verifying one thing and claiming another.
|
|
46
|
+
- Treating a clean exit code as proof when the output contains warnings or partial failures.
|
|
47
|
+
- Re-using stale evidence from a previous run after making further changes.
|