aigent-team 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +253 -0
- package/dist/chunk-N3RYHWTR.js +267 -0
- package/dist/cli.js +576 -0
- package/dist/index.d.ts +234 -0
- package/dist/index.js +27 -0
- package/package.json +67 -0
- package/templates/shared/git-workflow.md +44 -0
- package/templates/shared/project-conventions.md +48 -0
- package/templates/teams/ba/agent.yaml +25 -0
- package/templates/teams/ba/references/acceptance-criteria.md +87 -0
- package/templates/teams/ba/references/api-contract-design.md +110 -0
- package/templates/teams/ba/references/requirements-analysis.md +83 -0
- package/templates/teams/ba/references/user-story-mapping.md +73 -0
- package/templates/teams/ba/skill.md +85 -0
- package/templates/teams/be/agent.yaml +34 -0
- package/templates/teams/be/conventions.md +102 -0
- package/templates/teams/be/references/api-design.md +91 -0
- package/templates/teams/be/references/async-processing.md +86 -0
- package/templates/teams/be/references/auth-security.md +58 -0
- package/templates/teams/be/references/caching.md +79 -0
- package/templates/teams/be/references/database.md +65 -0
- package/templates/teams/be/references/error-handling.md +106 -0
- package/templates/teams/be/references/observability.md +83 -0
- package/templates/teams/be/references/review-checklist.md +50 -0
- package/templates/teams/be/references/testing.md +100 -0
- package/templates/teams/be/review-checklist.md +54 -0
- package/templates/teams/be/skill.md +71 -0
- package/templates/teams/devops/agent.yaml +35 -0
- package/templates/teams/devops/conventions.md +133 -0
- package/templates/teams/devops/references/ci-cd.md +218 -0
- package/templates/teams/devops/references/cost-optimization.md +218 -0
- package/templates/teams/devops/references/disaster-recovery.md +199 -0
- package/templates/teams/devops/references/docker.md +237 -0
- package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
- package/templates/teams/devops/references/kubernetes.md +397 -0
- package/templates/teams/devops/references/monitoring.md +224 -0
- package/templates/teams/devops/references/review-checklist.md +149 -0
- package/templates/teams/devops/references/security.md +225 -0
- package/templates/teams/devops/review-checklist.md +72 -0
- package/templates/teams/devops/skill.md +131 -0
- package/templates/teams/fe/agent.yaml +28 -0
- package/templates/teams/fe/conventions.md +80 -0
- package/templates/teams/fe/references/accessibility.md +92 -0
- package/templates/teams/fe/references/component-architecture.md +87 -0
- package/templates/teams/fe/references/css-styling.md +89 -0
- package/templates/teams/fe/references/forms.md +73 -0
- package/templates/teams/fe/references/performance.md +104 -0
- package/templates/teams/fe/references/review-checklist.md +51 -0
- package/templates/teams/fe/references/security.md +90 -0
- package/templates/teams/fe/references/state-management.md +117 -0
- package/templates/teams/fe/references/testing.md +112 -0
- package/templates/teams/fe/review-checklist.md +53 -0
- package/templates/teams/fe/skill.md +68 -0
- package/templates/teams/lead/agent.yaml +18 -0
- package/templates/teams/lead/references/cross-team-coordination.md +68 -0
- package/templates/teams/lead/references/quality-gates.md +64 -0
- package/templates/teams/lead/references/task-decomposition.md +69 -0
- package/templates/teams/lead/skill.md +83 -0
- package/templates/teams/qa/agent.yaml +32 -0
- package/templates/teams/qa/conventions.md +130 -0
- package/templates/teams/qa/references/ci-integration.md +337 -0
- package/templates/teams/qa/references/e2e-testing.md +292 -0
- package/templates/teams/qa/references/mocking.md +249 -0
- package/templates/teams/qa/references/performance-testing.md +288 -0
- package/templates/teams/qa/references/review-checklist.md +143 -0
- package/templates/teams/qa/references/security-testing.md +271 -0
- package/templates/teams/qa/references/test-data.md +275 -0
- package/templates/teams/qa/references/test-strategy.md +192 -0
- package/templates/teams/qa/review-checklist.md +53 -0
- package/templates/teams/qa/skill.md +131 -0
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
# Caching
|
|
2
|
+
|
|
3
|
+
## Strategy Selection
|
|
4
|
+
|
|
5
|
+
| Pattern | When to use | How it works |
|
|
6
|
+
|---------|------------|--------------|
|
|
7
|
+
| **Cache-aside** | Read-heavy, stale data OK | Check cache → miss → query DB → write cache |
|
|
8
|
+
| **Write-through** | Data changes often, consistency critical | Write DB + cache simultaneously |
|
|
9
|
+
| **Write-behind** | High write volume, eventual consistency OK | Write cache → async flush to DB |
|
|
10
|
+
| **Memoization** | Expensive computation, immutable inputs | Compute once → cache result with TTL |
|
|
11
|
+
|
|
12
|
+
## Cache Key Design
|
|
13
|
+
|
|
14
|
+
Format: `{service}:{entity}:{id}:{version}`
|
|
15
|
+
```
|
|
16
|
+
user-service:profile:123:v2
|
|
17
|
+
order-service:list:user_456:page_1:sort_date
|
|
18
|
+
search:results:hash_of_query_params
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
- Keys must be deterministic (same input → same key)
|
|
22
|
+
- Include version to handle schema changes
|
|
23
|
+
- Hash complex parameters instead of using them directly
|
|
24
|
+
|
|
25
|
+
## TTL Guidelines
|
|
26
|
+
|
|
27
|
+
| Data type | TTL | Reason |
|
|
28
|
+
|-----------|-----|--------|
|
|
29
|
+
| App config | 1 hour | Changes rarely, OK to be slightly stale |
|
|
30
|
+
| User profile | 15 min | Changes occasionally |
|
|
31
|
+
| Search results | 5 min | Changes frequently |
|
|
32
|
+
| Auth session | Match token expiry | Security-sensitive |
|
|
33
|
+
| Never: infinite TTL | - | Memory leak, stale data |
|
|
34
|
+
|
|
35
|
+
## Cache Stampede Prevention
|
|
36
|
+
|
|
37
|
+
When a popular cache key expires, thousands of requests simultaneously hit the DB:
|
|
38
|
+
|
|
39
|
+
**Solution 1: Probabilistic early expiration**
|
|
40
|
+
```
|
|
41
|
+
actual_ttl = ttl - random(0, ttl * 0.1)
|
|
42
|
+
// Some requests refresh before expiry, preventing simultaneous miss
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
**Solution 2: Lock-based recomputation**
|
|
46
|
+
```
|
|
47
|
+
if cache miss:
|
|
48
|
+
if acquire_lock(key):
|
|
49
|
+
result = query_db()
|
|
50
|
+
cache.set(key, result, ttl)
|
|
51
|
+
release_lock(key)
|
|
52
|
+
else:
|
|
53
|
+
wait_and_retry() // or return stale data
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Invalidation Strategy
|
|
57
|
+
|
|
58
|
+
- **Event-driven** (preferred): On write, publish cache-invalidation event
|
|
59
|
+
```
|
|
60
|
+
orderService.create(order) → publish('order:created', { id: order.id })
|
|
61
|
+
cacheSubscriber.on('order:created') → cache.del(`order:${id}`)
|
|
62
|
+
```
|
|
63
|
+
- **TTL-only** (safety net): Cache expires naturally. Simpler but allows stale reads.
|
|
64
|
+
- **Write-through**: Update cache on every write. Consistent but slower writes.
|
|
65
|
+
|
|
66
|
+
Rules:
|
|
67
|
+
- Always set TTL even with event-driven invalidation (safety net)
|
|
68
|
+
- Cache invalidation is one of the two hard problems — err on the side of shorter TTL
|
|
69
|
+
- Monitor cache hit rate — target > 90%. Below 80% = likely misconfigured TTL or key design
|
|
70
|
+
|
|
71
|
+
## Multi-layer Caching
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
Request → Local memory cache (1 min) → Redis (15 min) → Database
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
- L1 (in-process): Fastest, but per-instance (not shared). For truly hot data.
|
|
78
|
+
- L2 (Redis/Memcached): Shared across instances. Primary cache layer.
|
|
79
|
+
- Invalidate both layers on write.
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# Database
|
|
2
|
+
|
|
3
|
+
## Schema Standards
|
|
4
|
+
|
|
5
|
+
- Every table has: `id` (UUID v7 or ULID — sortable, no sequential guessing), `created_at`, `updated_at` timestamps
|
|
6
|
+
- Soft deletes: `deleted_at` column. All queries filter `WHERE deleted_at IS NULL`. Never hard-delete unless legally required (GDPR).
|
|
7
|
+
- Use `NOT NULL` with defaults wherever possible. Nullable columns require explicit justification.
|
|
8
|
+
|
|
9
|
+
## Indexes
|
|
10
|
+
|
|
11
|
+
- Create indexes for every column used in `WHERE`, `JOIN`, or `ORDER BY`
|
|
12
|
+
- Composite indexes: left-prefix rule — put the most selective (high-cardinality) column first
|
|
13
|
+
- Cover frequently used queries with covering indexes (include all SELECT columns)
|
|
14
|
+
- Run `EXPLAIN ANALYZE` on every new/modified query to verify index usage
|
|
15
|
+
- Create indexes concurrently in production: `CREATE INDEX CONCURRENTLY` (Postgres) to avoid table locks
|
|
16
|
+
|
|
17
|
+
## Query Performance
|
|
18
|
+
|
|
19
|
+
- A single API request should execute ≤ 5 queries. More = missing JOIN or need to denormalize.
|
|
20
|
+
- N+1 detection: enable ORM query logging in tests, count queries per endpoint.
|
|
21
|
+
- Use `EXPLAIN ANALYZE` to check:
|
|
22
|
+
- Seq Scan on large tables (missing index)
|
|
23
|
+
- Hash Join on large datasets (consider optimization)
|
|
24
|
+
- Sort operations (add index for ORDER BY)
|
|
25
|
+
- High row estimates vs actuals (stale statistics → `ANALYZE`)
|
|
26
|
+
|
|
27
|
+
## Transactions
|
|
28
|
+
|
|
29
|
+
- Use for any operation modifying multiple tables
|
|
30
|
+
- Scope as narrowly as possible — don't hold locks during HTTP calls
|
|
31
|
+
- Read-only queries outside transactions (no unnecessary locking)
|
|
32
|
+
- Use optimistic locking (`version` column) for concurrent update scenarios:
|
|
33
|
+
```sql
|
|
34
|
+
UPDATE orders SET status = 'confirmed', version = version + 1
|
|
35
|
+
WHERE id = :id AND version = :expected_version
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Connection Pooling
|
|
39
|
+
|
|
40
|
+
- Pool size: `(number_of_cores * 2) + effective_spindle_count` — typically 10-20 per service instance
|
|
41
|
+
- Never unlimited connections
|
|
42
|
+
- Monitor pool usage — exhaustion causes cascading failures
|
|
43
|
+
- Use PgBouncer or application-level pooling for serverless environments
|
|
44
|
+
|
|
45
|
+
## Migration Safety
|
|
46
|
+
|
|
47
|
+
**Zero-downtime migration sequence (for breaking changes):**
|
|
48
|
+
1. Add new column with default value (backward compatible)
|
|
49
|
+
2. Deploy code that writes to BOTH old and new columns
|
|
50
|
+
3. Backfill existing data in batches (not in the migration)
|
|
51
|
+
4. Deploy code that reads from new column
|
|
52
|
+
5. Drop old column (separate migration, after verification)
|
|
53
|
+
|
|
54
|
+
**Rules:**
|
|
55
|
+
- Schema migrations must complete in < 30 seconds
|
|
56
|
+
- Data backfills in separate scripts, run in batches of 1000
|
|
57
|
+
- Always test `down` migration — verify rollback works
|
|
58
|
+
- Large table migrations: `ALTER TABLE ... ADD COLUMN ... DEFAULT` is fast in Postgres 11+ (metadata-only)
|
|
59
|
+
- Test with production-scale data volume
|
|
60
|
+
|
|
61
|
+
## Read Replicas
|
|
62
|
+
|
|
63
|
+
- Use for reporting, analytics, search queries
|
|
64
|
+
- Write to primary only
|
|
65
|
+
- Account for replication lag in application code — after write, read from primary for consistency
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Error Handling & Resilience
|
|
2
|
+
|
|
3
|
+
## Domain Error Hierarchy
|
|
4
|
+
|
|
5
|
+
Create typed errors that map to HTTP status codes in one place:
|
|
6
|
+
```typescript
|
|
7
|
+
abstract class DomainError extends Error {
|
|
8
|
+
abstract statusCode: number;
|
|
9
|
+
abstract code: string;
|
|
10
|
+
}
|
|
11
|
+
|
|
12
|
+
class NotFoundError extends DomainError {
|
|
13
|
+
statusCode = 404;
|
|
14
|
+
code = 'NOT_FOUND';
|
|
15
|
+
}
|
|
16
|
+
|
|
17
|
+
class ConflictError extends DomainError {
|
|
18
|
+
statusCode = 409;
|
|
19
|
+
code = 'CONFLICT';
|
|
20
|
+
}
|
|
21
|
+
|
|
22
|
+
class ValidationError extends DomainError {
|
|
23
|
+
statusCode = 400;
|
|
24
|
+
code = 'VALIDATION_ERROR';
|
|
25
|
+
constructor(public details: Array<{ field: string; message: string }>) {
|
|
26
|
+
super('Validation failed');
|
|
27
|
+
}
|
|
28
|
+
}
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
Map in error handler middleware, not in every controller:
|
|
32
|
+
```typescript
|
|
33
|
+
app.use((err, req, res, next) => {
|
|
34
|
+
if (err instanceof DomainError) {
|
|
35
|
+
return res.status(err.statusCode).json({
|
|
36
|
+
error: { code: err.code, message: err.message }
|
|
37
|
+
});
|
|
38
|
+
}
|
|
39
|
+
// Unknown error = 500, log full stack
|
|
40
|
+
logger.error({ err, requestId: req.id }, 'Unhandled error');
|
|
41
|
+
res.status(500).json({ error: { code: 'INTERNAL_ERROR', message: 'Something went wrong' } });
|
|
42
|
+
});
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## External Service Resilience
|
|
46
|
+
|
|
47
|
+
**Timeouts:**
|
|
48
|
+
- Connect timeout: 3 seconds (can't establish connection = service is down)
|
|
49
|
+
- Read timeout: 10 seconds (waiting for response)
|
|
50
|
+
- Total timeout: 15 seconds max
|
|
51
|
+
|
|
52
|
+
**Retries with exponential backoff + jitter:**
|
|
53
|
+
```
|
|
54
|
+
Attempt 1: immediate
|
|
55
|
+
Attempt 2: 1s + random(0-500ms)
|
|
56
|
+
Attempt 3: 2s + random(0-500ms)
|
|
57
|
+
Attempt 4: 4s + random(0-500ms)
|
|
58
|
+
Max: 3 retries
|
|
59
|
+
```
|
|
60
|
+
Only retry on 5xx and network errors. Never retry 4xx (client error).
|
|
61
|
+
|
|
62
|
+
**Circuit breaker:**
|
|
63
|
+
- **Closed** (normal): requests pass through. Track failure rate.
|
|
64
|
+
- **Open** (failing): requests fail immediately without calling the service. Timer starts.
|
|
65
|
+
- **Half-open** (testing): allow one request through. If it succeeds → close. If it fails → open again.
|
|
66
|
+
- Trigger: 5 consecutive failures or >50% failure rate in 10-second window.
|
|
67
|
+
|
|
68
|
+
**Graceful degradation:**
|
|
69
|
+
- Non-critical service down (recommendations, analytics) → main flow continues, skip the feature
|
|
70
|
+
- Critical service down (auth, payment) → fail with clear error, retry mechanism
|
|
71
|
+
- Return cached data when possible for degraded responses
|
|
72
|
+
|
|
73
|
+
## Health Checks
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
GET /health/live → { status: "ok" } // Process alive (K8s liveness probe)
|
|
77
|
+
GET /health/ready → { status: "ok", // Can serve traffic (K8s readiness)
|
|
78
|
+
checks: {
|
|
79
|
+
database: "ok",
|
|
80
|
+
cache: "ok",
|
|
81
|
+
queue: "ok"
|
|
82
|
+
}}
|
|
83
|
+
```
|
|
84
|
+
- Liveness: never check dependencies (prevents cascade restarts)
|
|
85
|
+
- Readiness: check all critical dependencies
|
|
86
|
+
|
|
87
|
+
## Graceful Shutdown
|
|
88
|
+
|
|
89
|
+
On SIGTERM:
|
|
90
|
+
1. Stop accepting new requests (remove from load balancer)
|
|
91
|
+
2. Finish in-flight requests (30 second timeout)
|
|
92
|
+
3. Close database connections
|
|
93
|
+
4. Close queue connections
|
|
94
|
+
5. Exit process
|
|
95
|
+
|
|
96
|
+
```typescript
|
|
97
|
+
process.on('SIGTERM', async () => {
|
|
98
|
+
server.close(); // Stop accepting
|
|
99
|
+
await Promise.race([
|
|
100
|
+
finishInFlightRequests(),
|
|
101
|
+
new Promise(resolve => setTimeout(resolve, 30000)), // 30s max
|
|
102
|
+
]);
|
|
103
|
+
await db.disconnect();
|
|
104
|
+
process.exit(0);
|
|
105
|
+
});
|
|
106
|
+
```
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# Observability
|
|
2
|
+
|
|
3
|
+
## Three Pillars
|
|
4
|
+
|
|
5
|
+
### 1. Structured Logging
|
|
6
|
+
|
|
7
|
+
Every log entry must include:
|
|
8
|
+
```json
|
|
9
|
+
{
|
|
10
|
+
"timestamp": "2024-01-15T10:30:00.000Z",
|
|
11
|
+
"level": "info",
|
|
12
|
+
"message": "Order created",
|
|
13
|
+
"service": "order-service",
|
|
14
|
+
"environment": "production",
|
|
15
|
+
"request_id": "req_abc123",
|
|
16
|
+
"user_id": "usr_456",
|
|
17
|
+
"duration_ms": 45,
|
|
18
|
+
"order_id": "ord_789"
|
|
19
|
+
}
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
**Log levels:**
|
|
23
|
+
- `ERROR`: Something broke, requires investigation. Pages on-call in production.
|
|
24
|
+
- `WARN`: Unexpected but handled. Rate limit hit, cache miss on hot key, slow query.
|
|
25
|
+
- `INFO`: Significant business events: user registered, order placed, payment processed.
|
|
26
|
+
- `DEBUG`: Development only. Never enable in production (performance + noise).
|
|
27
|
+
|
|
28
|
+
**Never log:** passwords, tokens, credit card numbers, full email addresses, session IDs.
|
|
29
|
+
|
|
30
|
+
### 2. Distributed Tracing
|
|
31
|
+
|
|
32
|
+
- Propagate trace context (`traceparent` header) across all service calls
|
|
33
|
+
- Every outgoing HTTP, gRPC, queue message carries the trace ID
|
|
34
|
+
- Instrument: incoming requests, outgoing HTTP calls, database queries, cache operations, queue publish/consume
|
|
35
|
+
- Sample rate: 100% for errors, 10% for success in production
|
|
36
|
+
|
|
37
|
+
### 3. Metrics (Prometheus format)
|
|
38
|
+
|
|
39
|
+
Essential metrics to expose:
|
|
40
|
+
- `http_requests_total{method, path, status}` — request rate
|
|
41
|
+
- `http_request_duration_seconds{method, path}` — latency histogram (p50, p95, p99)
|
|
42
|
+
- `http_requests_in_flight` — active connections
|
|
43
|
+
- `db_query_duration_seconds{query}` — database latency
|
|
44
|
+
- `cache_hits_total` / `cache_misses_total` — cache effectiveness
|
|
45
|
+
- `queue_depth{queue}` — backlog size
|
|
46
|
+
- `circuit_breaker_state{service}` — open/closed/half-open
|
|
47
|
+
|
|
48
|
+
## Alerting
|
|
49
|
+
|
|
50
|
+
**Alert on symptoms (user impact), not causes:**
|
|
51
|
+
- GOOD: "Error rate > 1% for 5 minutes"
|
|
52
|
+
- BAD: "CPU > 80%" (might be normal)
|
|
53
|
+
- GOOD: "p99 latency > 2s for 10 minutes"
|
|
54
|
+
- BAD: "Memory > 70%" (might be normal for JVM)
|
|
55
|
+
|
|
56
|
+
**Severity levels:**
|
|
57
|
+
| Level | Action | Example |
|
|
58
|
+
|-------|--------|---------|
|
|
59
|
+
| P1 Critical | Page on-call NOW | Error rate > 5%, service down |
|
|
60
|
+
| P2 High | Slack team channel | Error rate > 1%, p99 > 2x baseline |
|
|
61
|
+
| P3 Medium | Create ticket | Disk > 80%, cert expires in 14d |
|
|
62
|
+
| P4 Low | Dashboard only | Cost anomaly, deprecation warning |
|
|
63
|
+
|
|
64
|
+
Every P1/P2 alert must have a runbook.
|
|
65
|
+
|
|
66
|
+
## Request Tracing Pattern
|
|
67
|
+
|
|
68
|
+
```typescript
|
|
69
|
+
// Middleware: create request context
|
|
70
|
+
app.use((req, res, next) => {
|
|
71
|
+
req.id = req.headers['x-request-id'] || crypto.randomUUID();
|
|
72
|
+
res.setHeader('x-request-id', req.id);
|
|
73
|
+
next();
|
|
74
|
+
});
|
|
75
|
+
|
|
76
|
+
// Logger: auto-include request context
|
|
77
|
+
const logger = createLogger({
|
|
78
|
+
defaultMeta: { service: 'api', env: process.env.NODE_ENV },
|
|
79
|
+
});
|
|
80
|
+
|
|
81
|
+
// Usage: always include request_id
|
|
82
|
+
logger.info({ request_id: req.id, user_id: user.id }, 'Order created');
|
|
83
|
+
```
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# Backend Review Checklist
|
|
2
|
+
|
|
3
|
+
### API Contract
|
|
4
|
+
- [ ] URL is RESTful and consistent with existing endpoints
|
|
5
|
+
- [ ] HTTP status codes semantically correct (not just 200 and 500)
|
|
6
|
+
- [ ] Request/response schemas documented (OpenAPI/Swagger updated)
|
|
7
|
+
- [ ] Error responses include machine-readable `code` + human-readable `message`
|
|
8
|
+
- [ ] Pagination for list endpoints (cursor-based for large datasets)
|
|
9
|
+
- [ ] No breaking changes to existing API contracts
|
|
10
|
+
|
|
11
|
+
### Data Layer
|
|
12
|
+
- [ ] No N+1 queries — verified with query logging
|
|
13
|
+
- [ ] Queries use indexes — ran `EXPLAIN ANALYZE` on new queries
|
|
14
|
+
- [ ] Transactions scoped correctly (cover related mutations, not too broad)
|
|
15
|
+
- [ ] Migrations backward compatible (old code works with new schema during deploy)
|
|
16
|
+
- [ ] Large table migrations done in phases (no locks on tables > 100K rows)
|
|
17
|
+
- [ ] Soft deletes used (unless GDPR erasure required)
|
|
18
|
+
|
|
19
|
+
### Security
|
|
20
|
+
- [ ] All input validated at API boundary (schema validation, string length limits)
|
|
21
|
+
- [ ] No SQL injection vectors — parameterized queries only
|
|
22
|
+
- [ ] IDOR check — resource queries scoped by authenticated user
|
|
23
|
+
- [ ] Auth/authz middleware applied with correct role/permission
|
|
24
|
+
- [ ] Rate limiting configured
|
|
25
|
+
- [ ] No PII in logs
|
|
26
|
+
- [ ] Secrets from secret manager, not hardcoded
|
|
27
|
+
|
|
28
|
+
### Error Handling & Resilience
|
|
29
|
+
- [ ] All error paths return appropriate HTTP status codes
|
|
30
|
+
- [ ] External service calls have timeouts (connect + read)
|
|
31
|
+
- [ ] Retries use exponential backoff with jitter
|
|
32
|
+
- [ ] Circuit breaker for non-critical dependencies
|
|
33
|
+
- [ ] Race conditions considered (optimistic locking, idempotency keys)
|
|
34
|
+
|
|
35
|
+
### Observability
|
|
36
|
+
- [ ] Structured JSON logs with `request_id`, `user_id`, `duration_ms`
|
|
37
|
+
- [ ] Error logs include stack trace and context
|
|
38
|
+
- [ ] New metrics exposed if applicable
|
|
39
|
+
- [ ] Distributed trace context propagated
|
|
40
|
+
|
|
41
|
+
### Async Processing
|
|
42
|
+
- [ ] Queue jobs are idempotent
|
|
43
|
+
- [ ] Dead letter queue configured
|
|
44
|
+
- [ ] Job payloads minimal (IDs, not full objects)
|
|
45
|
+
|
|
46
|
+
### Testing
|
|
47
|
+
- [ ] Integration tests: happy path, validation, auth, not-found, concurrent
|
|
48
|
+
- [ ] Edge cases: empty body, max-length, unicode, null vs missing
|
|
49
|
+
- [ ] Database state isolated per test
|
|
50
|
+
- [ ] Mocks at correct boundary (external services, not internal modules)
|
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# Backend Testing
|
|
2
|
+
|
|
3
|
+
## Test Levels
|
|
4
|
+
|
|
5
|
+
**Unit tests** (service layer business logic):
|
|
6
|
+
- Test in isolation. Mock repositories and external services.
|
|
7
|
+
- Focus on business rules, edge cases, error paths.
|
|
8
|
+
- Fast: < 5ms per test.
|
|
9
|
+
|
|
10
|
+
**Integration tests** (API endpoints — most valuable):
|
|
11
|
+
- Test real HTTP request → response with a real database.
|
|
12
|
+
- Use testcontainers or Docker Compose for database.
|
|
13
|
+
- Each test creates its own data, cleans up after.
|
|
14
|
+
- Cover: happy path, validation errors, auth failures, not-found, concurrent access.
|
|
15
|
+
|
|
16
|
+
**Contract tests** (service boundaries):
|
|
17
|
+
- Use Pact for consumer-driven contracts between services.
|
|
18
|
+
- Each service tests its own contracts independently.
|
|
19
|
+
- Runs in CI — breaks the build if contract violated.
|
|
20
|
+
|
|
21
|
+
**Load tests** (performance):
|
|
22
|
+
- Run before every release touching data path.
|
|
23
|
+
- Baseline: system must handle 2x current peak traffic.
|
|
24
|
+
- Use k6 or Artillery with realistic data patterns.
|
|
25
|
+
|
|
26
|
+
## Database Test Isolation
|
|
27
|
+
|
|
28
|
+
**Option 1: Transaction rollback** (fastest)
|
|
29
|
+
```typescript
|
|
30
|
+
beforeEach(async () => {
|
|
31
|
+
await db.beginTransaction();
|
|
32
|
+
});
|
|
33
|
+
afterEach(async () => {
|
|
34
|
+
await db.rollback();
|
|
35
|
+
});
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**Option 2: Truncate tables** (simpler)
|
|
39
|
+
```typescript
|
|
40
|
+
afterEach(async () => {
|
|
41
|
+
await db.query('TRUNCATE users, orders, payments CASCADE');
|
|
42
|
+
});
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
**Option 3: Unique identifiers** (for parallel tests)
|
|
46
|
+
```typescript
|
|
47
|
+
const testId = crypto.randomUUID();
|
|
48
|
+
const user = await createUser({ email: `${testId}@test.com` });
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Rules:
|
|
52
|
+
- Same database engine as production (not SQLite when prod is Postgres)
|
|
53
|
+
- Each test creates what it needs — no shared seed data
|
|
54
|
+
- Tests must pass in any order, in parallel
|
|
55
|
+
|
|
56
|
+
## Test Structure
|
|
57
|
+
|
|
58
|
+
```typescript
|
|
59
|
+
describe('POST /api/orders', () => {
|
|
60
|
+
it('should create order and return 201', async () => {
|
|
61
|
+
const user = await createTestUser();
|
|
62
|
+
const product = await createTestProduct({ price: 2999 });
|
|
63
|
+
|
|
64
|
+
const response = await request(app)
|
|
65
|
+
.post('/api/orders')
|
|
66
|
+
.set('Authorization', `Bearer ${user.token}`)
|
|
67
|
+
.send({ productId: product.id, quantity: 2 });
|
|
68
|
+
|
|
69
|
+
expect(response.status).toBe(201);
|
|
70
|
+
expect(response.body.data.total).toBe(5998);
|
|
71
|
+
expect(response.body.data.status).toBe('pending');
|
|
72
|
+
});
|
|
73
|
+
|
|
74
|
+
it('should return 400 when quantity is zero', async () => {
|
|
75
|
+
const user = await createTestUser();
|
|
76
|
+
const response = await request(app)
|
|
77
|
+
.post('/api/orders')
|
|
78
|
+
.set('Authorization', `Bearer ${user.token}`)
|
|
79
|
+
.send({ productId: 'prod_1', quantity: 0 });
|
|
80
|
+
|
|
81
|
+
expect(response.status).toBe(400);
|
|
82
|
+
expect(response.body.error.code).toBe('VALIDATION_ERROR');
|
|
83
|
+
});
|
|
84
|
+
|
|
85
|
+
it('should return 401 without auth token', async () => {
|
|
86
|
+
const response = await request(app)
|
|
87
|
+
.post('/api/orders')
|
|
88
|
+
.send({ productId: 'prod_1', quantity: 1 });
|
|
89
|
+
|
|
90
|
+
expect(response.status).toBe(401);
|
|
91
|
+
});
|
|
92
|
+
});
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## What to Test
|
|
96
|
+
|
|
97
|
+
- **Always**: Happy path, validation errors, auth/authz, not-found, edge cases (empty, max, unicode)
|
|
98
|
+
- **Important**: Concurrent access (two users modify same resource), race conditions
|
|
99
|
+
- **For data mutations**: Verify database state changed correctly (not just API response)
|
|
100
|
+
- **For async operations**: Verify job was enqueued with correct payload
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
### API Contract
|
|
2
|
+
- [ ] URL is RESTful and consistent with existing endpoints
|
|
3
|
+
- [ ] HTTP status codes are semantically correct (not just 200 and 500)
|
|
4
|
+
- [ ] Request/response schemas are documented (OpenAPI/Swagger updated)
|
|
5
|
+
- [ ] Error responses include machine-readable `code` + human-readable `message`
|
|
6
|
+
- [ ] Pagination implemented for list endpoints (cursor-based for large datasets)
|
|
7
|
+
- [ ] No breaking changes to existing API contracts (or version bumped)
|
|
8
|
+
|
|
9
|
+
### Data Layer
|
|
10
|
+
- [ ] No N+1 queries — checked by enabling query logging during tests
|
|
11
|
+
- [ ] Queries use indexes — ran `EXPLAIN ANALYZE` on new/modified queries
|
|
12
|
+
- [ ] Transactions scope is correct — covers all related mutations, not too broad
|
|
13
|
+
- [ ] Migrations are backward compatible (old code works with new schema during deploy)
|
|
14
|
+
- [ ] Large table migrations done in phases (no table locks on tables >100K rows)
|
|
15
|
+
- [ ] Soft deletes used instead of hard deletes (unless GDPR erasure required)
|
|
16
|
+
- [ ] All new columns have appropriate defaults and constraints (NOT NULL where applicable)
|
|
17
|
+
|
|
18
|
+
### Security
|
|
19
|
+
- [ ] All input validated at API boundary (Zod/Pydantic schema, string length limits, enum checks)
|
|
20
|
+
- [ ] No SQL injection vectors — all queries use parameterized statements
|
|
21
|
+
- [ ] IDOR check — resource queries scoped by authenticated user's ID/org
|
|
22
|
+
- [ ] Auth/authz middleware applied — endpoint requires correct role/permission
|
|
23
|
+
- [ ] Rate limiting configured — per-user for authenticated, per-IP for public
|
|
24
|
+
- [ ] No PII in logs (emails, tokens, passwords, credit card numbers scrubbed)
|
|
25
|
+
- [ ] Secrets from secret manager — not hardcoded or in env files committed to git
|
|
26
|
+
- [ ] Mass assignment protection — only whitelisted fields accepted from request body
|
|
27
|
+
|
|
28
|
+
### Error Handling & Resilience
|
|
29
|
+
- [ ] All error paths return appropriate HTTP status codes (no `catch → res.json({ error })`)
|
|
30
|
+
- [ ] External service calls have timeouts configured (connect + read)
|
|
31
|
+
- [ ] Retries use exponential backoff with jitter — no fixed-interval retries
|
|
32
|
+
- [ ] Circuit breaker for non-critical dependencies — failure doesn't cascade
|
|
33
|
+
- [ ] Graceful degradation — if recommendation service is down, main flow still works
|
|
34
|
+
- [ ] Race conditions considered — concurrent requests to same resource handled (optimistic locking, idempotency keys)
|
|
35
|
+
|
|
36
|
+
### Observability
|
|
37
|
+
- [ ] Structured JSON logs with `request_id`, `user_id`, `duration_ms`
|
|
38
|
+
- [ ] Error logs include stack trace and context (not just error message)
|
|
39
|
+
- [ ] New metrics exposed for monitoring (request rate, error rate, latency)
|
|
40
|
+
- [ ] Distributed trace context propagated in outgoing requests
|
|
41
|
+
- [ ] Health check endpoints updated if new dependencies added
|
|
42
|
+
|
|
43
|
+
### Async Processing
|
|
44
|
+
- [ ] Queue jobs are idempotent — safe to retry on failure
|
|
45
|
+
- [ ] Dead letter queue configured for failed messages
|
|
46
|
+
- [ ] Long-running jobs expose progress/status endpoint
|
|
47
|
+
- [ ] Job payloads are minimal (IDs, not full objects) — fetch fresh data in the worker
|
|
48
|
+
|
|
49
|
+
### Testing
|
|
50
|
+
- [ ] Integration tests cover: happy path, validation errors, auth failures, not-found, concurrent access
|
|
51
|
+
- [ ] Edge cases tested: empty body, max-length fields, unicode, null vs missing fields
|
|
52
|
+
- [ ] Database state isolated per test — no shared data, no ordering dependency
|
|
53
|
+
- [ ] Mocks at correct boundary — mock external services, not internal modules
|
|
54
|
+
- [ ] Load test results reviewed if endpoint is on the hot path
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# Backend Agent
|
|
2
|
+
|
|
3
|
+
You are a senior backend engineer with 8+ years of experience building production systems handling millions of requests.
|
|
4
|
+
|
|
5
|
+
## Core Principles
|
|
6
|
+
|
|
7
|
+
1. **Design for failure**: Every external call (DB, cache, API) will fail. Handle timeouts, retries with exponential backoff + jitter, circuit breakers, graceful degradation.
|
|
8
|
+
2. **Data integrity over speed**: Never sacrifice correctness for performance. Use transactions for multi-step mutations. Idempotency keys for retryable operations.
|
|
9
|
+
3. **Observability is not logging**: Structured logs, distributed traces, and metrics are three separate pillars. A request must be traceable from gateway through every service to DB and back.
|
|
10
|
+
4. **Security is a constraint, not a feature**: Auth, input validation, rate limiting, encryption are non-negotiable baselines.
|
|
11
|
+
5. **Separation of concerns**: Controllers handle HTTP (thin). Services handle business logic. Repositories abstract data access. Never mix layers.
|
|
12
|
+
|
|
13
|
+
## Key Anti-patterns — Catch Immediately
|
|
14
|
+
|
|
15
|
+
- **N+1 queries**: Loading a list then querying each item. Always eager-load or batch.
|
|
16
|
+
- **Transactions spanning HTTP calls**: If step 3 of 5 fails, can you roll back 1-2? Use Sagas or Outbox pattern.
|
|
17
|
+
- **Catching exceptions → returning 200**: Use proper HTTP status codes.
|
|
18
|
+
- **String concatenation for SQL**: Always parameterized queries, even "internal" ones.
|
|
19
|
+
- **Logging PII**: Emails, tokens, passwords — scrub before logging, even in debug.
|
|
20
|
+
- **Unbounded queries**: `SELECT *` without LIMIT, endpoints returning all records without pagination.
|
|
21
|
+
- **Single point of failure**: One DB instance, one cache node. Plan redundancy.
|
|
22
|
+
|
|
23
|
+
## Decision Frameworks
|
|
24
|
+
|
|
25
|
+
**API design:**
|
|
26
|
+
- CRUD on resources → REST (`GET /users/:id`, `POST /users`)
|
|
27
|
+
- Complex queries / aggregation → GraphQL or dedicated search endpoint
|
|
28
|
+
- Real-time → WebSocket or SSE
|
|
29
|
+
- Service-to-service, high throughput → gRPC
|
|
30
|
+
|
|
31
|
+
**Caching strategy:**
|
|
32
|
+
- Data changes rarely + stale OK → Cache-aside with TTL
|
|
33
|
+
- Data changes often + stale NOT OK → Write-through
|
|
34
|
+
- Expensive computation + immutable input → Memoization with TTL
|
|
35
|
+
|
|
36
|
+
**Async processing:**
|
|
37
|
+
- Takes > 500ms or can fail independently → Message queue (email, PDF, webhooks)
|
|
38
|
+
- Jobs must be idempotent (safe to retry)
|
|
39
|
+
- Dead letter queue for failures, monitor DLQ size
|
|
40
|
+
|
|
41
|
+
## Reference Files
|
|
42
|
+
|
|
43
|
+
Read the relevant reference when working on specific tasks:
|
|
44
|
+
|
|
45
|
+
| Reference | When to read |
|
|
46
|
+
|-----------|-------------|
|
|
47
|
+
| `api-design.md` | Creating endpoints, designing request/response schemas |
|
|
48
|
+
| `database.md` | Schema design, migrations, query optimization, indexes |
|
|
49
|
+
| `auth-security.md` | Authentication, authorization, IDOR prevention, rate limiting |
|
|
50
|
+
| `error-handling.md` | Error hierarchy, circuit breakers, graceful degradation, retries |
|
|
51
|
+
| `observability.md` | Logging, tracing, metrics, alerting |
|
|
52
|
+
| `caching.md` | Cache strategy, invalidation, stampede prevention |
|
|
53
|
+
| `async-processing.md` | Queues, jobs, idempotency, dead letter queues |
|
|
54
|
+
| `testing.md` | Unit/integration/contract/load testing strategies |
|
|
55
|
+
| `review-checklist.md` | Reviewing any backend PR |
|
|
56
|
+
|
|
57
|
+
## Workflows
|
|
58
|
+
|
|
59
|
+
### Create API Endpoint
|
|
60
|
+
1. Define contract first: method, URL, request/response schema (Zod/Pydantic), error responses
|
|
61
|
+
2. Input validation at controller layer → Service layer logic → Repository for data
|
|
62
|
+
3. Auth middleware + rate limiting
|
|
63
|
+
4. Proper error handling (domain errors → HTTP status codes)
|
|
64
|
+
5. Structured logging, tests, OpenAPI docs update
|
|
65
|
+
→ Read `references/api-design.md` for full procedure
|
|
66
|
+
|
|
67
|
+
### Database Migration
|
|
68
|
+
→ Read `references/database.md` for backward-compatible migration procedure
|
|
69
|
+
|
|
70
|
+
### Code Review
|
|
71
|
+
→ Read `references/review-checklist.md` for full checklist
|
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
id: devops
|
|
2
|
+
name: DevOps Agent
|
|
3
|
+
description: >
|
|
4
|
+
Senior DevOps/SRE agent. Expert in CI/CD pipeline architecture,
|
|
5
|
+
container orchestration, infrastructure as code, GitOps, observability,
|
|
6
|
+
cost optimization, and disaster recovery.
|
|
7
|
+
role: devops
|
|
8
|
+
techStack:
|
|
9
|
+
languages: [YAML, HCL, Bash, Python, Go]
|
|
10
|
+
frameworks: [Terraform, Pulumi, CDK, Crossplane]
|
|
11
|
+
libraries: [Helm, Kustomize, ArgoCD, Flux, GitHub Actions, GitLab CI]
|
|
12
|
+
buildTools: [Docker, Kubernetes, Nginx, Prometheus, Grafana, Loki, Jaeger, Vault, Trivy, Falco]
|
|
13
|
+
tools:
|
|
14
|
+
allowed: [Read, Write, Edit, Bash, Grep, Glob]
|
|
15
|
+
globs:
|
|
16
|
+
- "Dockerfile*"
|
|
17
|
+
- "docker-compose*.yml"
|
|
18
|
+
- "docker-compose*.yaml"
|
|
19
|
+
- ".github/workflows/**/*"
|
|
20
|
+
- ".gitlab-ci.yml"
|
|
21
|
+
- "Jenkinsfile"
|
|
22
|
+
- "terraform/**/*"
|
|
23
|
+
- "**/*.tf"
|
|
24
|
+
- "**/*.tfvars"
|
|
25
|
+
- "k8s/**/*"
|
|
26
|
+
- "helm/**/*"
|
|
27
|
+
- "charts/**/*"
|
|
28
|
+
- "infra/**/*"
|
|
29
|
+
- "deploy/**/*"
|
|
30
|
+
- "scripts/**/*"
|
|
31
|
+
- "Makefile"
|
|
32
|
+
- "monitoring/**/*"
|
|
33
|
+
sharedKnowledge:
|
|
34
|
+
- project-conventions
|
|
35
|
+
- git-workflow
|