aigent-team 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +253 -0
  3. package/dist/chunk-N3RYHWTR.js +267 -0
  4. package/dist/cli.js +576 -0
  5. package/dist/index.d.ts +234 -0
  6. package/dist/index.js +27 -0
  7. package/package.json +67 -0
  8. package/templates/shared/git-workflow.md +44 -0
  9. package/templates/shared/project-conventions.md +48 -0
  10. package/templates/teams/ba/agent.yaml +25 -0
  11. package/templates/teams/ba/references/acceptance-criteria.md +87 -0
  12. package/templates/teams/ba/references/api-contract-design.md +110 -0
  13. package/templates/teams/ba/references/requirements-analysis.md +83 -0
  14. package/templates/teams/ba/references/user-story-mapping.md +73 -0
  15. package/templates/teams/ba/skill.md +85 -0
  16. package/templates/teams/be/agent.yaml +34 -0
  17. package/templates/teams/be/conventions.md +102 -0
  18. package/templates/teams/be/references/api-design.md +91 -0
  19. package/templates/teams/be/references/async-processing.md +86 -0
  20. package/templates/teams/be/references/auth-security.md +58 -0
  21. package/templates/teams/be/references/caching.md +79 -0
  22. package/templates/teams/be/references/database.md +65 -0
  23. package/templates/teams/be/references/error-handling.md +106 -0
  24. package/templates/teams/be/references/observability.md +83 -0
  25. package/templates/teams/be/references/review-checklist.md +50 -0
  26. package/templates/teams/be/references/testing.md +100 -0
  27. package/templates/teams/be/review-checklist.md +54 -0
  28. package/templates/teams/be/skill.md +71 -0
  29. package/templates/teams/devops/agent.yaml +35 -0
  30. package/templates/teams/devops/conventions.md +133 -0
  31. package/templates/teams/devops/references/ci-cd.md +218 -0
  32. package/templates/teams/devops/references/cost-optimization.md +218 -0
  33. package/templates/teams/devops/references/disaster-recovery.md +199 -0
  34. package/templates/teams/devops/references/docker.md +237 -0
  35. package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
  36. package/templates/teams/devops/references/kubernetes.md +397 -0
  37. package/templates/teams/devops/references/monitoring.md +224 -0
  38. package/templates/teams/devops/references/review-checklist.md +149 -0
  39. package/templates/teams/devops/references/security.md +225 -0
  40. package/templates/teams/devops/review-checklist.md +72 -0
  41. package/templates/teams/devops/skill.md +131 -0
  42. package/templates/teams/fe/agent.yaml +28 -0
  43. package/templates/teams/fe/conventions.md +80 -0
  44. package/templates/teams/fe/references/accessibility.md +92 -0
  45. package/templates/teams/fe/references/component-architecture.md +87 -0
  46. package/templates/teams/fe/references/css-styling.md +89 -0
  47. package/templates/teams/fe/references/forms.md +73 -0
  48. package/templates/teams/fe/references/performance.md +104 -0
  49. package/templates/teams/fe/references/review-checklist.md +51 -0
  50. package/templates/teams/fe/references/security.md +90 -0
  51. package/templates/teams/fe/references/state-management.md +117 -0
  52. package/templates/teams/fe/references/testing.md +112 -0
  53. package/templates/teams/fe/review-checklist.md +53 -0
  54. package/templates/teams/fe/skill.md +68 -0
  55. package/templates/teams/lead/agent.yaml +18 -0
  56. package/templates/teams/lead/references/cross-team-coordination.md +68 -0
  57. package/templates/teams/lead/references/quality-gates.md +64 -0
  58. package/templates/teams/lead/references/task-decomposition.md +69 -0
  59. package/templates/teams/lead/skill.md +83 -0
  60. package/templates/teams/qa/agent.yaml +32 -0
  61. package/templates/teams/qa/conventions.md +130 -0
  62. package/templates/teams/qa/references/ci-integration.md +337 -0
  63. package/templates/teams/qa/references/e2e-testing.md +292 -0
  64. package/templates/teams/qa/references/mocking.md +249 -0
  65. package/templates/teams/qa/references/performance-testing.md +288 -0
  66. package/templates/teams/qa/references/review-checklist.md +143 -0
  67. package/templates/teams/qa/references/security-testing.md +271 -0
  68. package/templates/teams/qa/references/test-data.md +275 -0
  69. package/templates/teams/qa/references/test-strategy.md +192 -0
  70. package/templates/teams/qa/review-checklist.md +53 -0
  71. package/templates/teams/qa/skill.md +131 -0
@@ -0,0 +1,79 @@
1
+ # Caching
2
+
3
+ ## Strategy Selection
4
+
5
+ | Pattern | When to use | How it works |
6
+ |---------|------------|--------------|
7
+ | **Cache-aside** | Read-heavy, stale data OK | Check cache → miss → query DB → write cache |
8
+ | **Write-through** | Data changes often, consistency critical | Write DB + cache simultaneously |
9
+ | **Write-behind** | High write volume, eventual consistency OK | Write cache → async flush to DB |
10
+ | **Memoization** | Expensive computation, immutable inputs | Compute once → cache result with TTL |
11
+
12
+ ## Cache Key Design
13
+
14
+ Format: `{service}:{entity}:{id}:{version}`
15
+ ```
16
+ user-service:profile:123:v2
17
+ order-service:list:user_456:page_1:sort_date
18
+ search:results:hash_of_query_params
19
+ ```
20
+
21
+ - Keys must be deterministic (same input → same key)
22
+ - Include version to handle schema changes
23
+ - Hash complex parameters instead of using them directly
24
+
25
+ ## TTL Guidelines
26
+
27
+ | Data type | TTL | Reason |
28
+ |-----------|-----|--------|
29
+ | App config | 1 hour | Changes rarely, OK to be slightly stale |
30
+ | User profile | 15 min | Changes occasionally |
31
+ | Search results | 5 min | Changes frequently |
32
+ | Auth session | Match token expiry | Security-sensitive |
33
+ | Never: infinite TTL | - | Memory leak, stale data |
34
+
35
+ ## Cache Stampede Prevention
36
+
37
+ When a popular cache key expires, thousands of requests simultaneously hit the DB:
38
+
39
+ **Solution 1: Probabilistic early expiration**
40
+ ```
41
+ actual_ttl = ttl - random(0, ttl * 0.1)
42
+ // Some requests refresh before expiry, preventing simultaneous miss
43
+ ```
44
+
45
+ **Solution 2: Lock-based recomputation**
46
+ ```
47
+ if cache miss:
48
+ if acquire_lock(key):
49
+ result = query_db()
50
+ cache.set(key, result, ttl)
51
+ release_lock(key)
52
+ else:
53
+ wait_and_retry() // or return stale data
54
+ ```
55
+
56
+ ## Invalidation Strategy
57
+
58
+ - **Event-driven** (preferred): On write, publish cache-invalidation event
59
+ ```
60
+ orderService.create(order) → publish('order:created', { id: order.id })
61
+ cacheSubscriber.on('order:created') → cache.del(`order:${id}`)
62
+ ```
63
+ - **TTL-only** (safety net): Cache expires naturally. Simpler but allows stale reads.
64
+ - **Write-through**: Update cache on every write. Consistent but slower writes.
65
+
66
+ Rules:
67
+ - Always set TTL even with event-driven invalidation (safety net)
68
+ - Cache invalidation is one of the two hard problems — err on the side of shorter TTL
69
+ - Monitor cache hit rate — target > 90%. Below 80% = likely misconfigured TTL or key design
70
+
71
+ ## Multi-layer Caching
72
+
73
+ ```
74
+ Request → Local memory cache (1 min) → Redis (15 min) → Database
75
+ ```
76
+
77
+ - L1 (in-process): Fastest, but per-instance (not shared). For truly hot data.
78
+ - L2 (Redis/Memcached): Shared across instances. Primary cache layer.
79
+ - Invalidate both layers on write.
@@ -0,0 +1,65 @@
1
+ # Database
2
+
3
+ ## Schema Standards
4
+
5
+ - Every table has: `id` (UUID v7 or ULID — sortable, no sequential guessing), `created_at`, `updated_at` timestamps
6
+ - Soft deletes: `deleted_at` column. All queries filter `WHERE deleted_at IS NULL`. Never hard-delete unless legally required (GDPR).
7
+ - Use `NOT NULL` with defaults wherever possible. Nullable columns require explicit justification.
8
+
9
+ ## Indexes
10
+
11
+ - Create indexes for every column used in `WHERE`, `JOIN`, or `ORDER BY`
12
+ - Composite indexes: left-prefix rule — put the most selective (high-cardinality) column first
13
+ - Cover frequently used queries with covering indexes (include all SELECT columns)
14
+ - Run `EXPLAIN ANALYZE` on every new/modified query to verify index usage
15
+ - Create indexes concurrently in production: `CREATE INDEX CONCURRENTLY` (Postgres) to avoid table locks
16
+
17
+ ## Query Performance
18
+
19
+ - A single API request should execute ≤ 5 queries. More = missing JOIN or need to denormalize.
20
+ - N+1 detection: enable ORM query logging in tests, count queries per endpoint.
21
+ - Use `EXPLAIN ANALYZE` to check:
22
+ - Seq Scan on large tables (missing index)
23
+ - Hash Join on large datasets (consider optimization)
24
+ - Sort operations (add index for ORDER BY)
25
+ - High row estimates vs actuals (stale statistics → `ANALYZE`)
26
+
27
+ ## Transactions
28
+
29
+ - Use for any operation modifying multiple tables
30
+ - Scope as narrowly as possible — don't hold locks during HTTP calls
31
+ - Read-only queries outside transactions (no unnecessary locking)
32
+ - Use optimistic locking (`version` column) for concurrent update scenarios:
33
+ ```sql
34
+ UPDATE orders SET status = 'confirmed', version = version + 1
35
+ WHERE id = :id AND version = :expected_version
36
+ ```
37
+
38
+ ## Connection Pooling
39
+
40
+ - Pool size: `(number_of_cores * 2) + effective_spindle_count` — typically 10-20 per service instance
41
+ - Never unlimited connections
42
+ - Monitor pool usage — exhaustion causes cascading failures
43
+ - Use PgBouncer or application-level pooling for serverless environments
44
+
45
+ ## Migration Safety
46
+
47
+ **Zero-downtime migration sequence (for breaking changes):**
48
+ 1. Add new column with default value (backward compatible)
49
+ 2. Deploy code that writes to BOTH old and new columns
50
+ 3. Backfill existing data in batches (not in the migration)
51
+ 4. Deploy code that reads from new column
52
+ 5. Drop old column (separate migration, after verification)
53
+
54
+ **Rules:**
55
+ - Schema migrations must complete in < 30 seconds
56
+ - Data backfills in separate scripts, run in batches of 1000
57
+ - Always test `down` migration — verify rollback works
58
+ - Large table migrations: `ALTER TABLE ... ADD COLUMN ... DEFAULT` is fast in Postgres 11+ (metadata-only)
59
+ - Test with production-scale data volume
60
+
61
+ ## Read Replicas
62
+
63
+ - Use for reporting, analytics, search queries
64
+ - Write to primary only
65
+ - Account for replication lag in application code — after write, read from primary for consistency
@@ -0,0 +1,106 @@
1
+ # Error Handling & Resilience
2
+
3
+ ## Domain Error Hierarchy
4
+
5
+ Create typed errors that map to HTTP status codes in one place:
6
+ ```typescript
7
+ abstract class DomainError extends Error {
8
+ abstract statusCode: number;
9
+ abstract code: string;
10
+ }
11
+
12
+ class NotFoundError extends DomainError {
13
+ statusCode = 404;
14
+ code = 'NOT_FOUND';
15
+ }
16
+
17
+ class ConflictError extends DomainError {
18
+ statusCode = 409;
19
+ code = 'CONFLICT';
20
+ }
21
+
22
+ class ValidationError extends DomainError {
23
+ statusCode = 400;
24
+ code = 'VALIDATION_ERROR';
25
+ constructor(public details: Array<{ field: string; message: string }>) {
26
+ super('Validation failed');
27
+ }
28
+ }
29
+ ```
30
+
31
+ Map in error handler middleware, not in every controller:
32
+ ```typescript
33
+ app.use((err, req, res, next) => {
34
+ if (err instanceof DomainError) {
35
+ return res.status(err.statusCode).json({
36
+ error: { code: err.code, message: err.message }
37
+ });
38
+ }
39
+ // Unknown error = 500, log full stack
40
+ logger.error({ err, requestId: req.id }, 'Unhandled error');
41
+ res.status(500).json({ error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' } });
42
+ });
43
+ ```
44
+
45
+ ## External Service Resilience
46
+
47
+ **Timeouts:**
48
+ - Connect timeout: 3 seconds (can't establish connection = service is down)
49
+ - Read timeout: 10 seconds (waiting for response)
50
+ - Total timeout: 15 seconds max
51
+
52
+ **Retries with exponential backoff + jitter:**
53
+ ```
54
+ Attempt 1: immediate
55
+ Attempt 2: 1s + random(0-500ms)
56
+ Attempt 3: 2s + random(0-500ms)
57
+ Attempt 4: 4s + random(0-500ms)
58
+ Max: 3 retries
59
+ ```
60
+ Only retry on 5xx and network errors. Never retry 4xx (client error).
61
+
62
+ **Circuit breaker:**
63
+ - **Closed** (normal): requests pass through. Track failure rate.
64
+ - **Open** (failing): requests fail immediately without calling the service. Timer starts.
65
+ - **Half-open** (testing): allow one request through. If it succeeds → close. If it fails → open again.
66
+ - Trigger: 5 consecutive failures or >50% failure rate in 10-second window.
67
+
68
+ **Graceful degradation:**
69
+ - Non-critical service down (recommendations, analytics) → main flow continues, skip the feature
70
+ - Critical service down (auth, payment) → fail with clear error, retry mechanism
71
+ - Return cached data when possible for degraded responses
72
+
73
+ ## Health Checks
74
+
75
+ ```
76
+ GET /health/live → { status: "ok" } // Process alive (K8s liveness probe)
77
+ GET /health/ready → { status: "ok", // Can serve traffic (K8s readiness)
78
+ checks: {
79
+ database: "ok",
80
+ cache: "ok",
81
+ queue: "ok"
82
+ }}
83
+ ```
84
+ - Liveness: never check dependencies (prevents cascade restarts)
85
+ - Readiness: check all critical dependencies
86
+
87
+ ## Graceful Shutdown
88
+
89
+ On SIGTERM:
90
+ 1. Stop accepting new requests (remove from load balancer)
91
+ 2. Finish in-flight requests (30 second timeout)
92
+ 3. Close database connections
93
+ 4. Close queue connections
94
+ 5. Exit process
95
+
96
+ ```typescript
97
+ process.on('SIGTERM', async () => {
98
+ server.close(); // Stop accepting
99
+ await Promise.race([
100
+ finishInFlightRequests(),
101
+ new Promise(resolve => setTimeout(resolve, 30000)), // 30s max
102
+ ]);
103
+ await db.disconnect();
104
+ process.exit(0);
105
+ });
106
+ ```
@@ -0,0 +1,83 @@
1
+ # Observability
2
+
3
+ ## Three Pillars
4
+
5
+ ### 1. Structured Logging
6
+
7
+ Every log entry must include:
8
+ ```json
9
+ {
10
+ "timestamp": "2024-01-15T10:30:00.000Z",
11
+ "level": "info",
12
+ "message": "Order created",
13
+ "service": "order-service",
14
+ "environment": "production",
15
+ "request_id": "req_abc123",
16
+ "user_id": "usr_456",
17
+ "duration_ms": 45,
18
+ "order_id": "ord_789"
19
+ }
20
+ ```
21
+
22
+ **Log levels:**
23
+ - `ERROR`: Something broke, requires investigation. Pages on-call in production.
24
+ - `WARN`: Unexpected but handled. Rate limit hit, cache miss on hot key, slow query.
25
+ - `INFO`: Significant business events: user registered, order placed, payment processed.
26
+ - `DEBUG`: Development only. Never enable in production (performance + noise).
27
+
28
+ **Never log:** passwords, tokens, credit card numbers, full email addresses, session IDs.
29
+
30
+ ### 2. Distributed Tracing
31
+
32
+ - Propagate trace context (`traceparent` header) across all service calls
33
+ - Every outgoing HTTP, gRPC, queue message carries the trace ID
34
+ - Instrument: incoming requests, outgoing HTTP calls, database queries, cache operations, queue publish/consume
35
+ - Sample rate: 100% for errors, 10% for success in production
36
+
37
+ ### 3. Metrics (Prometheus format)
38
+
39
+ Essential metrics to expose:
40
+ - `http_requests_total{method, path, status}` — request rate
41
+ - `http_request_duration_seconds{method, path}` — latency histogram (p50, p95, p99)
42
+ - `http_requests_in_flight` — active connections
43
+ - `db_query_duration_seconds{query}` — database latency
44
+ - `cache_hits_total` / `cache_misses_total` — cache effectiveness
45
+ - `queue_depth{queue}` — backlog size
46
+ - `circuit_breaker_state{service}` — open/closed/half-open
47
+
48
+ ## Alerting
49
+
50
+ **Alert on symptoms (user impact), not causes:**
51
+ - GOOD: "Error rate > 1% for 5 minutes"
52
+ - BAD: "CPU > 80%" (might be normal)
53
+ - GOOD: "p99 latency > 2s for 10 minutes"
54
+ - BAD: "Memory > 70%" (might be normal for JVM)
55
+
56
+ **Severity levels:**
57
+ | Level | Action | Example |
58
+ |-------|--------|---------|
59
+ | P1 Critical | Page on-call NOW | Error rate > 5%, service down |
60
+ | P2 High | Slack team channel | Error rate > 1%, p99 > 2x baseline |
61
+ | P3 Medium | Create ticket | Disk > 80%, cert expires in 14d |
62
+ | P4 Low | Dashboard only | Cost anomaly, deprecation warning |
63
+
64
+ Every P1/P2 alert must have a runbook.
65
+
66
+ ## Request Tracing Pattern
67
+
68
+ ```typescript
69
+ // Middleware: create request context
70
+ app.use((req, res, next) => {
71
+ req.id = req.headers['x-request-id'] || crypto.randomUUID();
72
+ res.setHeader('x-request-id', req.id);
73
+ next();
74
+ });
75
+
76
+ // Logger: auto-include request context
77
+ const logger = createLogger({
78
+ defaultMeta: { service: 'api', env: process.env.NODE_ENV },
79
+ });
80
+
81
+ // Usage: always include request_id
82
+ logger.info({ request_id: req.id, user_id: user.id }, 'Order created');
83
+ ```
@@ -0,0 +1,50 @@
1
+ # Backend Review Checklist
2
+
3
+ ### API Contract
4
+ - [ ] URL is RESTful and consistent with existing endpoints
5
+ - [ ] HTTP status codes semantically correct (not just 200 and 500)
6
+ - [ ] Request/response schemas documented (OpenAPI/Swagger updated)
7
+ - [ ] Error responses include machine-readable `code` + human-readable `message`
8
+ - [ ] Pagination for list endpoints (cursor-based for large datasets)
9
+ - [ ] No breaking changes to existing API contracts
10
+
11
+ ### Data Layer
12
+ - [ ] No N+1 queries — verified with query logging
13
+ - [ ] Queries use indexes — ran `EXPLAIN ANALYZE` on new queries
14
+ - [ ] Transactions scoped correctly (cover related mutations, not too broad)
15
+ - [ ] Migrations backward compatible (old code works with new schema during deploy)
16
+ - [ ] Large table migrations done in phases (no locks on tables > 100K rows)
17
+ - [ ] Soft deletes used (unless GDPR erasure required)
18
+
19
+ ### Security
20
+ - [ ] All input validated at API boundary (schema validation, string length limits)
21
+ - [ ] No SQL injection vectors — parameterized queries only
22
+ - [ ] IDOR check — resource queries scoped by authenticated user
23
+ - [ ] Auth/authz middleware applied with correct role/permission
24
+ - [ ] Rate limiting configured
25
+ - [ ] No PII in logs
26
+ - [ ] Secrets from secret manager, not hardcoded
27
+
28
+ ### Error Handling & Resilience
29
+ - [ ] All error paths return appropriate HTTP status codes
30
+ - [ ] External service calls have timeouts (connect + read)
31
+ - [ ] Retries use exponential backoff with jitter
32
+ - [ ] Circuit breaker for non-critical dependencies
33
+ - [ ] Race conditions considered (optimistic locking, idempotency keys)
34
+
35
+ ### Observability
36
+ - [ ] Structured JSON logs with `request_id`, `user_id`, `duration_ms`
37
+ - [ ] Error logs include stack trace and context
38
+ - [ ] New metrics exposed if applicable
39
+ - [ ] Distributed trace context propagated
40
+
41
+ ### Async Processing
42
+ - [ ] Queue jobs are idempotent
43
+ - [ ] Dead letter queue configured
44
+ - [ ] Job payloads minimal (IDs, not full objects)
45
+
46
+ ### Testing
47
+ - [ ] Integration tests: happy path, validation, auth, not-found, concurrent
48
+ - [ ] Edge cases: empty body, max-length, unicode, null vs missing
49
+ - [ ] Database state isolated per test
50
+ - [ ] Mocks at correct boundary (external services, not internal modules)
@@ -0,0 +1,100 @@
1
+ # Backend Testing
2
+
3
+ ## Test Levels
4
+
5
+ **Unit tests** (service layer business logic):
6
+ - Test in isolation. Mock repositories and external services.
7
+ - Focus on business rules, edge cases, error paths.
8
+ - Fast: < 5ms per test.
9
+
10
+ **Integration tests** (API endpoints — most valuable):
11
+ - Test real HTTP request → response with a real database.
12
+ - Use testcontainers or Docker Compose for database.
13
+ - Each test creates its own data, cleans up after.
14
+ - Cover: happy path, validation errors, auth failures, not-found, concurrent access.
15
+
16
+ **Contract tests** (service boundaries):
17
+ - Use Pact for consumer-driven contracts between services.
18
+ - Each service tests its own contracts independently.
19
+ - Runs in CI — breaks the build if contract violated.
20
+
21
+ **Load tests** (performance):
22
+ - Run before every release touching data path.
23
+ - Baseline: system must handle 2x current peak traffic.
24
+ - Use k6 or Artillery with realistic data patterns.
25
+
26
+ ## Database Test Isolation
27
+
28
+ **Option 1: Transaction rollback** (fastest)
29
+ ```typescript
30
+ beforeEach(async () => {
31
+ await db.beginTransaction();
32
+ });
33
+ afterEach(async () => {
34
+ await db.rollback();
35
+ });
36
+ ```
37
+
38
+ **Option 2: Truncate tables** (simpler)
39
+ ```typescript
40
+ afterEach(async () => {
41
+ await db.query('TRUNCATE users, orders, payments CASCADE');
42
+ });
43
+ ```
44
+
45
+ **Option 3: Unique identifiers** (for parallel tests)
46
+ ```typescript
47
+ const testId = crypto.randomUUID();
48
+ const user = await createUser({ email: `${testId}@test.com` });
49
+ ```
50
+
51
+ Rules:
52
+ - Same database engine as production (not SQLite when prod is Postgres)
53
+ - Each test creates what it needs — no shared seed data
54
+ - Tests must pass in any order, in parallel
55
+
56
+ ## Test Structure
57
+
58
+ ```typescript
59
+ describe('POST /api/orders', () => {
60
+ it('should create order and return 201', async () => {
61
+ const user = await createTestUser();
62
+ const product = await createTestProduct({ price: 2999 });
63
+
64
+ const response = await request(app)
65
+ .post('/api/orders')
66
+ .set('Authorization', `Bearer ${user.token}`)
67
+ .send({ productId: product.id, quantity: 2 });
68
+
69
+ expect(response.status).toBe(201);
70
+ expect(response.body.data.total).toBe(5998);
71
+ expect(response.body.data.status).toBe('pending');
72
+ });
73
+
74
+ it('should return 400 when quantity is zero', async () => {
75
+ const user = await createTestUser();
76
+ const response = await request(app)
77
+ .post('/api/orders')
78
+ .set('Authorization', `Bearer ${user.token}`)
79
+ .send({ productId: 'prod_1', quantity: 0 });
80
+
81
+ expect(response.status).toBe(400);
82
+ expect(response.body.error.code).toBe('VALIDATION_ERROR');
83
+ });
84
+
85
+ it('should return 401 without auth token', async () => {
86
+ const response = await request(app)
87
+ .post('/api/orders')
88
+ .send({ productId: 'prod_1', quantity: 1 });
89
+
90
+ expect(response.status).toBe(401);
91
+ });
92
+ });
93
+ ```
94
+
95
+ ## What to Test
96
+
97
+ - **Always**: Happy path, validation errors, auth/authz, not-found, edge cases (empty, max, unicode)
98
+ - **Important**: Concurrent access (two users modify same resource), race conditions
99
+ - **For data mutations**: Verify database state changed correctly (not just API response)
100
+ - **For async operations**: Verify job was enqueued with correct payload
@@ -0,0 +1,54 @@
1
+ ### API Contract
2
+ - [ ] URL is RESTful and consistent with existing endpoints
3
+ - [ ] HTTP status codes are semantically correct (not just 200 and 500)
4
+ - [ ] Request/response schemas are documented (OpenAPI/Swagger updated)
5
+ - [ ] Error responses include machine-readable `code` + human-readable `message`
6
+ - [ ] Pagination implemented for list endpoints (cursor-based for large datasets)
7
+ - [ ] No breaking changes to existing API contracts (or version bumped)
8
+
9
+ ### Data Layer
10
+ - [ ] No N+1 queries — checked by enabling query logging during tests
11
+ - [ ] Queries use indexes — ran `EXPLAIN ANALYZE` on new/modified queries
12
+ - [ ] Transactions scope is correct — covers all related mutations, not too broad
13
+ - [ ] Migrations are backward compatible (old code works with new schema during deploy)
14
+ - [ ] Large table migrations done in phases (no table locks on tables >100K rows)
15
+ - [ ] Soft deletes used instead of hard deletes (unless GDPR erasure required)
16
+ - [ ] All new columns have appropriate defaults and constraints (NOT NULL where applicable)
17
+
18
+ ### Security
19
+ - [ ] All input validated at API boundary (Zod/Pydantic schema, string length limits, enum checks)
20
+ - [ ] No SQL injection vectors — all queries use parameterized statements
21
+ - [ ] IDOR check — resource queries scoped by authenticated user's ID/org
22
+ - [ ] Auth/authz middleware applied — endpoint requires correct role/permission
23
+ - [ ] Rate limiting configured — per-user for authenticated, per-IP for public
24
+ - [ ] No PII in logs (emails, tokens, passwords, credit card numbers scrubbed)
25
+ - [ ] Secrets from secret manager — not hardcoded or in env files committed to git
26
+ - [ ] Mass assignment protection — only whitelisted fields accepted from request body
27
+
28
+ ### Error Handling & Resilience
29
+ - [ ] All error paths return appropriate HTTP status codes (no `catch → res.json({ error })`)
30
+ - [ ] External service calls have timeouts configured (connect + read)
31
+ - [ ] Retries use exponential backoff with jitter — no fixed-interval retries
32
+ - [ ] Circuit breaker for non-critical dependencies — failure doesn't cascade
33
+ - [ ] Graceful degradation — if recommendation service is down, main flow still works
34
+ - [ ] Race conditions considered — concurrent requests to same resource handled (optimistic locking, idempotency keys)
35
+
36
+ ### Observability
37
+ - [ ] Structured JSON logs with `request_id`, `user_id`, `duration_ms`
38
+ - [ ] Error logs include stack trace and context (not just error message)
39
+ - [ ] New metrics exposed for monitoring (request rate, error rate, latency)
40
+ - [ ] Distributed trace context propagated in outgoing requests
41
+ - [ ] Health check endpoints updated if new dependencies added
42
+
43
+ ### Async Processing
44
+ - [ ] Queue jobs are idempotent — safe to retry on failure
45
+ - [ ] Dead letter queue configured for failed messages
46
+ - [ ] Long-running jobs expose progress/status endpoint
47
+ - [ ] Job payloads are minimal (IDs, not full objects) — fetch fresh data in the worker
48
+
49
+ ### Testing
50
+ - [ ] Integration tests cover: happy path, validation errors, auth failures, not-found, concurrent access
51
+ - [ ] Edge cases tested: empty body, max-length fields, unicode, null vs missing fields
52
+ - [ ] Database state isolated per test — no shared data, no ordering dependency
53
+ - [ ] Mocks at correct boundary — mock external services, not internal modules
54
+ - [ ] Load test results reviewed if endpoint is on the hot path
@@ -0,0 +1,71 @@
1
+ # Backend Agent
2
+
3
+ You are a senior backend engineer with 8+ years of experience building production systems handling millions of requests.
4
+
5
+ ## Core Principles
6
+
7
+ 1. **Design for failure**: Every external call (DB, cache, API) will fail. Handle timeouts, retries with exponential backoff + jitter, circuit breakers, graceful degradation.
8
+ 2. **Data integrity over speed**: Never sacrifice correctness for performance. Use transactions for multi-step mutations. Idempotency keys for retryable operations.
9
+ 3. **Observability is not logging**: Structured logs, distributed traces, and metrics are three separate pillars. A request must be traceable from gateway through every service to DB and back.
10
+ 4. **Security is a constraint, not a feature**: Auth, input validation, rate limiting, encryption are non-negotiable baselines.
11
+ 5. **Separation of concerns**: Controllers handle HTTP (thin). Services handle business logic. Repositories abstract data access. Never mix layers.
12
+
13
+ ## Key Anti-patterns — Catch Immediately
14
+
15
+ - **N+1 queries**: Loading a list then querying each item. Always eager-load or batch.
16
+ - **Transactions spanning HTTP calls**: If step 3 of 5 fails, can you roll back 1-2? Use Sagas or Outbox pattern.
17
+ - **Catching exceptions → returning 200**: Use proper HTTP status codes.
18
+ - **String concatenation for SQL**: Always parameterized queries, even "internal" ones.
19
+ - **Logging PII**: Emails, tokens, passwords — scrub before logging, even in debug.
20
+ - **Unbounded queries**: `SELECT *` without LIMIT, endpoints returning all records without pagination.
21
+ - **Single point of failure**: One DB instance, one cache node. Plan redundancy.
22
+
23
+ ## Decision Frameworks
24
+
25
+ **API design:**
26
+ - CRUD on resources → REST (`GET /users/:id`, `POST /users`)
27
+ - Complex queries / aggregation → GraphQL or dedicated search endpoint
28
+ - Real-time → WebSocket or SSE
29
+ - Service-to-service, high throughput → gRPC
30
+
31
+ **Caching strategy:**
32
+ - Data changes rarely + stale OK → Cache-aside with TTL
33
+ - Data changes often + stale NOT OK → Write-through
34
+ - Expensive computation + immutable input → Memoization with TTL
35
+
36
+ **Async processing:**
37
+ - Takes > 500ms or can fail independently → Message queue (email, PDF, webhooks)
38
+ - Jobs must be idempotent (safe to retry)
39
+ - Dead letter queue for failures, monitor DLQ size
40
+
41
+ ## Reference Files
42
+
43
+ Read the relevant reference when working on specific tasks:
44
+
45
+ | Reference | When to read |
46
+ |-----------|-------------|
47
+ | `api-design.md` | Creating endpoints, designing request/response schemas |
48
+ | `database.md` | Schema design, migrations, query optimization, indexes |
49
+ | `auth-security.md` | Authentication, authorization, IDOR prevention, rate limiting |
50
+ | `error-handling.md` | Error hierarchy, circuit breakers, graceful degradation, retries |
51
+ | `observability.md` | Logging, tracing, metrics, alerting |
52
+ | `caching.md` | Cache strategy, invalidation, stampede prevention |
53
+ | `async-processing.md` | Queues, jobs, idempotency, dead letter queues |
54
+ | `testing.md` | Unit/integration/contract/load testing strategies |
55
+ | `review-checklist.md` | Reviewing any backend PR |
56
+
57
+ ## Workflows
58
+
59
+ ### Create API Endpoint
60
+ 1. Define contract first: method, URL, request/response schema (Zod/Pydantic), error responses
61
+ 2. Input validation at controller layer → Service layer logic → Repository for data
62
+ 3. Auth middleware + rate limiting
63
+ 4. Proper error handling (domain errors → HTTP status codes)
64
+ 5. Structured logging, tests, OpenAPI docs update
65
+ → Read `references/api-design.md` for full procedure
66
+
67
+ ### Database Migration
68
+ → Read `references/database.md` for backward-compatible migration procedure
69
+
70
+ ### Code Review
71
+ → Read `references/review-checklist.md` for full checklist
@@ -0,0 +1,35 @@
1
+ id: devops
2
+ name: DevOps Agent
3
+ description: >
4
+ Senior DevOps/SRE agent. Expert in CI/CD pipeline architecture,
5
+ container orchestration, infrastructure as code, GitOps, observability,
6
+ cost optimization, and disaster recovery.
7
+ role: devops
8
+ techStack:
9
+ languages: [YAML, HCL, Bash, Python, Go]
10
+ frameworks: [Terraform, Pulumi, CDK, Crossplane]
11
+ libraries: [Helm, Kustomize, ArgoCD, Flux, GitHub Actions, GitLab CI]
12
+ buildTools: [Docker, Kubernetes, Nginx, Prometheus, Grafana, Loki, Jaeger, Vault, Trivy, Falco]
13
+ tools:
14
+ allowed: [Read, Write, Edit, Bash, Grep, Glob]
15
+ globs:
16
+ - "Dockerfile*"
17
+ - "docker-compose*.yml"
18
+ - "docker-compose*.yaml"
19
+ - ".github/workflows/**/*"
20
+ - ".gitlab-ci.yml"
21
+ - "Jenkinsfile"
22
+ - "terraform/**/*"
23
+ - "**/*.tf"
24
+ - "**/*.tfvars"
25
+ - "k8s/**/*"
26
+ - "helm/**/*"
27
+ - "charts/**/*"
28
+ - "infra/**/*"
29
+ - "deploy/**/*"
30
+ - "scripts/**/*"
31
+ - "Makefile"
32
+ - "monitoring/**/*"
33
+ sharedKnowledge:
34
+ - project-conventions
35
+ - git-workflow