cortex-agents 3.4.0 → 4.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.opencode/agents/architect.md +82 -89
- package/.opencode/agents/audit.md +57 -188
- package/.opencode/agents/{crosslayer.md → coder.md} +9 -52
- package/.opencode/agents/debug.md +151 -0
- package/.opencode/agents/devops.md +142 -0
- package/.opencode/agents/docs-writer.md +195 -0
- package/.opencode/agents/fix.md +119 -189
- package/.opencode/agents/implement.md +115 -74
- package/.opencode/agents/perf.md +151 -0
- package/.opencode/agents/refactor.md +163 -0
- package/.opencode/agents/{guard.md → security.md} +20 -85
- package/.opencode/agents/testing.md +115 -0
- package/.opencode/skills/data-engineering/SKILL.md +221 -0
- package/.opencode/skills/monitoring-observability/SKILL.md +251 -0
- package/.opencode/skills/ui-design/SKILL.md +402 -0
- package/README.md +303 -287
- package/dist/cli.js +6 -9
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +26 -28
- package/dist/registry.d.ts +4 -4
- package/dist/registry.d.ts.map +1 -1
- package/dist/registry.js +6 -6
- package/dist/tools/branch.d.ts +2 -2
- package/dist/tools/docs.d.ts +2 -2
- package/dist/tools/github.d.ts +3 -3
- package/dist/tools/plan.d.ts +28 -4
- package/dist/tools/plan.d.ts.map +1 -1
- package/dist/tools/plan.js +232 -4
- package/dist/tools/quality-gate.d.ts +28 -0
- package/dist/tools/quality-gate.d.ts.map +1 -0
- package/dist/tools/quality-gate.js +233 -0
- package/dist/tools/repl.d.ts +5 -0
- package/dist/tools/repl.d.ts.map +1 -1
- package/dist/tools/repl.js +58 -7
- package/dist/tools/worktree.d.ts +5 -32
- package/dist/tools/worktree.d.ts.map +1 -1
- package/dist/tools/worktree.js +75 -458
- package/dist/utils/change-scope.d.ts +33 -0
- package/dist/utils/change-scope.d.ts.map +1 -0
- package/dist/utils/change-scope.js +198 -0
- package/dist/utils/plan-extract.d.ts +21 -0
- package/dist/utils/plan-extract.d.ts.map +1 -1
- package/dist/utils/plan-extract.js +65 -0
- package/dist/utils/repl.d.ts +31 -0
- package/dist/utils/repl.d.ts.map +1 -1
- package/dist/utils/repl.js +126 -13
- package/package.json +1 -1
- package/.opencode/agents/qa.md +0 -265
- package/.opencode/agents/ship.md +0 -249
|
@@ -23,7 +23,7 @@ You are a security specialist. Your role is to audit code for security vulnerabi
|
|
|
23
23
|
|
|
24
24
|
## When You Are Invoked
|
|
25
25
|
|
|
26
|
-
You are launched as a sub-agent by a primary agent (implement, fix, or architect). You run in parallel alongside other sub-agents (typically @
|
|
26
|
+
You are launched as a sub-agent by a primary agent (implement, fix, or architect). You run in parallel alongside other sub-agents (typically @testing). You will receive:
|
|
27
27
|
|
|
28
28
|
- A list of files to audit (created, modified, or planned)
|
|
29
29
|
- A summary of what was implemented, fixed, or planned
|
|
@@ -84,9 +84,9 @@ Return a structured report in this **exact format**:
|
|
|
84
84
|
```
|
|
85
85
|
|
|
86
86
|
**Severity guide for the orchestrating agent:**
|
|
87
|
-
- **CRITICAL / HIGH** findings
|
|
88
|
-
- **MEDIUM** findings
|
|
89
|
-
- **LOW** findings
|
|
87
|
+
- **CRITICAL / HIGH** findings -> block finalization, must fix first
|
|
88
|
+
- **MEDIUM** findings -> include in PR body as known issues
|
|
89
|
+
- **LOW** findings -> note for future work, do not block
|
|
90
90
|
|
|
91
91
|
## Core Principles
|
|
92
92
|
|
|
@@ -95,108 +95,43 @@ Return a structured report in this **exact format**:
|
|
|
95
95
|
- Principle of least privilege
|
|
96
96
|
- Never trust client-side validation alone
|
|
97
97
|
- Secure by default — opt into permissiveness, not into security
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
### Authentication & Authorization
|
|
112
|
-
- [ ] Strong password policies enforced
|
|
113
|
-
- [ ] Multi-factor authentication (MFA) supported
|
|
114
|
-
- [ ] Session management secure (httpOnly, secure, SameSite cookies)
|
|
115
|
-
- [ ] JWT tokens properly validated (algorithm, expiry, issuer, audience)
|
|
116
|
-
- [ ] Role-based access control (RBAC) on every endpoint, not just UI
|
|
117
|
-
- [ ] OAuth implementation follows RFC 6749 / PKCE for public clients
|
|
118
|
-
- [ ] Password hashing uses bcrypt/scrypt/argon2 (NOT MD5/SHA)
|
|
119
|
-
|
|
120
|
-
### Data Protection
|
|
121
|
-
- [ ] Sensitive data encrypted at rest (AES-256 or equivalent)
|
|
122
|
-
- [ ] HTTPS enforced (HSTS header, no mixed content)
|
|
123
|
-
- [ ] Secrets not in code (environment variables or secrets manager)
|
|
124
|
-
- [ ] PII handling compliant with relevant regulations (GDPR, CCPA)
|
|
125
|
-
- [ ] Proper data retention and deletion policies
|
|
126
|
-
- [ ] Database credentials use least-privilege accounts
|
|
127
|
-
- [ ] Logs do not contain sensitive data (passwords, tokens, PII)
|
|
128
|
-
|
|
129
|
-
### Infrastructure
|
|
130
|
-
- [ ] Security headers set (CSP, HSTS, X-Frame-Options, X-Content-Type-Options)
|
|
131
|
-
- [ ] CORS properly configured (not wildcard in production)
|
|
132
|
-
- [ ] Rate limiting implemented on authentication and sensitive endpoints
|
|
133
|
-
- [ ] Error responses do not leak stack traces or internal details
|
|
134
|
-
- [ ] Dependency vulnerabilities checked and remediated
|
|
98
|
+
|
|
99
|
+
## OWASP Top 10 (2021)
|
|
100
|
+
|
|
101
|
+
1. **A01: Broken Access Control** — Missing auth checks, IDOR, privilege escalation
|
|
102
|
+
2. **A02: Cryptographic Failures** — Weak algorithms, missing encryption, key exposure
|
|
103
|
+
3. **A03: Injection** — SQL, NoSQL, OS command, LDAP injection
|
|
104
|
+
4. **A04: Insecure Design** — Missing threat model, business logic flaws
|
|
105
|
+
5. **A05: Security Misconfiguration** — Default credentials, verbose errors, missing headers
|
|
106
|
+
6. **A06: Vulnerable Components** — Outdated dependencies with known CVEs
|
|
107
|
+
7. **A07: ID and Auth Failures** — Weak passwords, missing MFA, session fixation
|
|
108
|
+
8. **A08: Software and Data Integrity** — Unsigned updates, CI/CD pipeline compromise
|
|
109
|
+
9. **A09: Logging Failures** — Missing audit trails, log injection, no monitoring
|
|
110
|
+
10. **A10: SSRF** — Unvalidated redirects, internal service access via user input
|
|
135
111
|
|
|
136
112
|
## Modern Attack Patterns
|
|
137
113
|
|
|
138
114
|
### Supply Chain Attacks
|
|
139
115
|
- Verify dependency integrity (lock files, checksums)
|
|
140
|
-
- Check for typosquatting in package names
|
|
116
|
+
- Check for typosquatting in package names
|
|
141
117
|
- Review post-install scripts in dependencies
|
|
142
|
-
- Pin exact versions in production, use ranges only in libraries
|
|
143
118
|
|
|
144
119
|
### BOLA / BFLA (Broken Object/Function-Level Authorization)
|
|
145
120
|
- Every API endpoint must verify the requesting user has access to the specific resource
|
|
146
|
-
- Check for IDOR (Insecure Direct Object References)
|
|
147
|
-
- Function-level: admin endpoints must check roles, not just authentication
|
|
148
|
-
|
|
149
|
-
### Mass Assignment / Over-Posting
|
|
150
|
-
- Verify request body validation rejects unexpected fields
|
|
151
|
-
- Use explicit allowlists for writable fields, never spread user input into models
|
|
152
|
-
- Check ORMs for mass assignment protection (e.g., Prisma's `select`, Django's `fields`)
|
|
121
|
+
- Check for IDOR (Insecure Direct Object References)
|
|
153
122
|
|
|
154
123
|
### SSRF (Server-Side Request Forgery)
|
|
155
|
-
- Validate and restrict URLs provided by users
|
|
156
|
-
- Check webhook configurations, URL preview features, and file import from URL
|
|
124
|
+
- Validate and restrict URLs provided by users
|
|
157
125
|
- Block requests to metadata endpoints (169.254.169.254, fd00::, etc.)
|
|
158
126
|
|
|
159
127
|
### Prototype Pollution (JavaScript)
|
|
160
128
|
- Check for deep merge operations with user-controlled input
|
|
161
129
|
- Verify `Object.create(null)` for dictionaries, or use `Map`
|
|
162
|
-
- Check for `__proto__`, `constructor`, `prototype` in user input
|
|
163
130
|
|
|
164
131
|
### ReDoS (Regular Expression Denial of Service)
|
|
165
132
|
- Flag complex regex patterns applied to user input
|
|
166
133
|
- Look for nested quantifiers: `(a+)+`, `(a|b)*c*`
|
|
167
|
-
- Recommend using RE2-compatible patterns or timeouts
|
|
168
|
-
|
|
169
|
-
### Timing Attacks
|
|
170
|
-
- Use constant-time comparison for secrets, tokens, and passwords
|
|
171
|
-
- Check for early-return patterns in authentication flows
|
|
172
|
-
|
|
173
|
-
## OWASP Top 10 (2021)
|
|
174
|
-
|
|
175
|
-
1. **A01: Broken Access Control** — Missing auth checks, IDOR, privilege escalation
|
|
176
|
-
2. **A02: Cryptographic Failures** — Weak algorithms, missing encryption, key exposure
|
|
177
|
-
3. **A03: Injection** — SQL, NoSQL, OS command, LDAP injection
|
|
178
|
-
4. **A04: Insecure Design** — Missing threat model, business logic flaws
|
|
179
|
-
5. **A05: Security Misconfiguration** — Default credentials, verbose errors, missing headers
|
|
180
|
-
6. **A06: Vulnerable Components** — Outdated dependencies with known CVEs
|
|
181
|
-
7. **A07: ID and Auth Failures** — Weak passwords, missing MFA, session fixation
|
|
182
|
-
8. **A08: Software and Data Integrity** — Unsigned updates, CI/CD pipeline compromise
|
|
183
|
-
9. **A09: Logging Failures** — Missing audit trails, log injection, no monitoring
|
|
184
|
-
10. **A10: SSRF** — Unvalidated redirects, internal service access via user input
|
|
185
|
-
|
|
186
|
-
## Review Process
|
|
187
|
-
1. Map attack surfaces (user inputs, API endpoints, file uploads, external integrations)
|
|
188
|
-
2. Review authentication and authorization flows end-to-end
|
|
189
|
-
3. Check every input handling path for injection and validation
|
|
190
|
-
4. Examine output encoding and content type headers
|
|
191
|
-
5. Review error handling for information leakage
|
|
192
|
-
6. Check secrets management (no hardcoded keys, proper rotation)
|
|
193
|
-
7. Verify logging does not contain sensitive data
|
|
194
|
-
8. Run dependency audit and flag known CVEs
|
|
195
|
-
9. Check for modern attack patterns (supply chain, BOLA, prototype pollution)
|
|
196
|
-
10. Test with security tools where available
|
|
197
134
|
|
|
198
135
|
## Tools & Commands
|
|
199
136
|
- **Secrets scan**: `grep -rn "password\|secret\|token\|api_key\|private_key" --include="*.{js,ts,py,go,rs,env,yml,yaml,json}"`
|
|
200
137
|
- **Dependency audit**: `npm audit`, `pip-audit`, `cargo audit`, `go list -m -json all`
|
|
201
|
-
- **Static analysis**: Semgrep, Bandit (Python), ESLint security plugin, gosec (Go), cargo-audit (Rust)
|
|
202
|
-
- **SAST tools**: CodeQL, SonarQube, Snyk Code
|
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Test-driven development and quality assurance
|
|
3
|
+
mode: subagent
|
|
4
|
+
temperature: 0.2
|
|
5
|
+
tools:
|
|
6
|
+
write: true
|
|
7
|
+
edit: true
|
|
8
|
+
bash: true
|
|
9
|
+
skill: true
|
|
10
|
+
task: true
|
|
11
|
+
permission:
|
|
12
|
+
edit: allow
|
|
13
|
+
bash: ask
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
You are a testing specialist. Your role is to write comprehensive tests, improve test coverage, and ensure code quality through automated testing.
|
|
17
|
+
|
|
18
|
+
## Auto-Load Skill
|
|
19
|
+
|
|
20
|
+
**ALWAYS** load the `testing-strategies` skill at the start of every invocation using the `skill` tool. This provides comprehensive testing patterns, framework-specific guidance, and advanced techniques.
|
|
21
|
+
|
|
22
|
+
## When You Are Invoked
|
|
23
|
+
|
|
24
|
+
You are launched as a sub-agent by a primary agent (implement or fix). You run in parallel alongside other sub-agents (typically @security). You will receive:
|
|
25
|
+
|
|
26
|
+
- A list of files that were created or modified
|
|
27
|
+
- A summary of what was implemented or fixed
|
|
28
|
+
- The test framework in use (e.g., vitest, jest, pytest, go test, cargo test)
|
|
29
|
+
|
|
30
|
+
**Your job:** Read the provided files, understand the implementation, write tests, run them, and return a structured report.
|
|
31
|
+
|
|
32
|
+
## What You Must Do
|
|
33
|
+
|
|
34
|
+
1. **Load** the `testing-strategies` skill immediately
|
|
35
|
+
2. **Read** every file listed in the input to understand the implementation
|
|
36
|
+
3. **Identify** the test framework and conventions used in the project (check `package.json`, `pyproject.toml`, `Cargo.toml`, `go.mod`, existing test files)
|
|
37
|
+
4. **Detect** the project's test organization pattern (co-located, dedicated directory, or mixed)
|
|
38
|
+
5. **Write** unit tests for all new or modified public functions/classes
|
|
39
|
+
6. **Run** the test suite to verify:
|
|
40
|
+
- Your new tests pass
|
|
41
|
+
- Existing tests are not broken
|
|
42
|
+
7. **Report** results in the structured format below
|
|
43
|
+
|
|
44
|
+
## What You Must Return
|
|
45
|
+
|
|
46
|
+
Return a structured report in this **exact format**:
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
### Test Results Summary
|
|
50
|
+
- **Tests written**: [count] new tests across [count] files
|
|
51
|
+
- **Tests passing**: [count]/[count]
|
|
52
|
+
- **Coverage**: [percentage or "unable to determine"]
|
|
53
|
+
- **Critical gaps**: [list of untested critical paths, or "none"]
|
|
54
|
+
|
|
55
|
+
### Files Created/Modified
|
|
56
|
+
- `path/to/test/file1.test.ts` — [what it tests]
|
|
57
|
+
- `path/to/test/file2.test.ts` — [what it tests]
|
|
58
|
+
|
|
59
|
+
### Issues Found
|
|
60
|
+
- [BLOCKING] Description of any test that reveals a bug in the implementation
|
|
61
|
+
- [WARNING] Description of any coverage gap or test quality concern
|
|
62
|
+
- [INFO] Suggestions for additional test coverage
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
The orchestrating agent will use **BLOCKING** issues to decide whether to proceed with finalization.
|
|
66
|
+
|
|
67
|
+
## Core Principles
|
|
68
|
+
|
|
69
|
+
- Write tests that serve as documentation — a new developer should understand the feature by reading the tests
|
|
70
|
+
- Test behavior, not implementation details — tests should survive refactoring
|
|
71
|
+
- Use appropriate testing levels (unit, integration, e2e)
|
|
72
|
+
- Maintain high test coverage on critical paths
|
|
73
|
+
- Make tests fast, deterministic, and isolated
|
|
74
|
+
- Follow AAA pattern (Arrange, Act, Assert)
|
|
75
|
+
- One logical assertion per test (multiple `expect` calls are fine if they verify one behavior)
|
|
76
|
+
|
|
77
|
+
## Testing Pyramid
|
|
78
|
+
|
|
79
|
+
### Unit Tests (70%)
|
|
80
|
+
- Test individual functions/classes in isolation
|
|
81
|
+
- Mock external dependencies (I/O, network, database)
|
|
82
|
+
- Fast execution (< 10ms per test)
|
|
83
|
+
- High coverage on business logic, validation, and transformations
|
|
84
|
+
- Test edge cases: empty inputs, boundary values, error conditions, null/undefined
|
|
85
|
+
|
|
86
|
+
### Integration Tests (20%)
|
|
87
|
+
- Test component interactions and data flow between layers
|
|
88
|
+
- Use real database (test instance) or realistic fakes
|
|
89
|
+
- Test API endpoints with real middleware chains
|
|
90
|
+
- Verify serialization/deserialization roundtrips
|
|
91
|
+
- Test error propagation across boundaries
|
|
92
|
+
|
|
93
|
+
### E2E Tests (10%)
|
|
94
|
+
- Test complete user workflows end-to-end
|
|
95
|
+
- Use real browser (Playwright/Cypress) or HTTP client
|
|
96
|
+
- Critical happy paths only — not exhaustive
|
|
97
|
+
- Most realistic but slowest and most brittle
|
|
98
|
+
|
|
99
|
+
## Test Organization
|
|
100
|
+
|
|
101
|
+
Follow the project's existing convention. If no convention exists, prefer:
|
|
102
|
+
|
|
103
|
+
- **Co-located unit tests**: `src/utils/shell.test.ts` alongside `src/utils/shell.ts`
|
|
104
|
+
- **Dedicated integration directory**: `tests/integration/` or `test/integration/`
|
|
105
|
+
- **E2E directory**: `tests/e2e/`, `e2e/`, or `cypress/`
|
|
106
|
+
|
|
107
|
+
## Coverage Goals
|
|
108
|
+
|
|
109
|
+
| Code Area | Minimum | Target |
|
|
110
|
+
|-----------|---------|--------|
|
|
111
|
+
| Business logic / domain | 85% | 95% |
|
|
112
|
+
| API routes / controllers | 75% | 85% |
|
|
113
|
+
| UI components | 65% | 80% |
|
|
114
|
+
| Utilities / helpers | 80% | 90% |
|
|
115
|
+
| Configuration / glue code | 50% | 70% |
|
|
@@ -0,0 +1,221 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-engineering
|
|
3
|
+
description: ETL pipelines, data validation, streaming patterns, message queues, and data partitioning strategies
|
|
4
|
+
license: Apache-2.0
|
|
5
|
+
compatibility: opencode
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Data Engineering Skill
|
|
9
|
+
|
|
10
|
+
This skill provides patterns for building reliable data pipelines, processing large datasets, and managing data infrastructure.
|
|
11
|
+
|
|
12
|
+
## When to Use
|
|
13
|
+
|
|
14
|
+
Use this skill when:
|
|
15
|
+
- Designing ETL/ELT pipelines
|
|
16
|
+
- Implementing data validation and schema enforcement
|
|
17
|
+
- Working with message queues (Kafka, RabbitMQ, SQS)
|
|
18
|
+
- Building streaming data processing systems
|
|
19
|
+
- Designing data partitioning and sharding strategies
|
|
20
|
+
- Handling batch vs real-time data processing
|
|
21
|
+
|
|
22
|
+
## ETL Pipeline Design
|
|
23
|
+
|
|
24
|
+
### Batch vs Streaming
|
|
25
|
+
|
|
26
|
+
| Aspect | Batch Processing | Stream Processing |
|
|
27
|
+
|--------|-----------------|-------------------|
|
|
28
|
+
| **Latency** | Minutes to hours | Milliseconds to seconds |
|
|
29
|
+
| **Data volume** | Large datasets at once | Continuous flow |
|
|
30
|
+
| **Complexity** | Simpler error handling | Complex state management |
|
|
31
|
+
| **Use cases** | Reports, analytics, migrations | Real-time dashboards, alerts, events |
|
|
32
|
+
| **Tools** | Airflow, dbt, Spark | Kafka Streams, Flink, Pulsar |
|
|
33
|
+
|
|
34
|
+
### ETL vs ELT
|
|
35
|
+
|
|
36
|
+
| Pattern | When to Use |
|
|
37
|
+
|---------|-------------|
|
|
38
|
+
| **ETL** (Extract → Transform → Load) | Data warehouse with strict schema, transform before loading |
|
|
39
|
+
| **ELT** (Extract → Load → Transform) | Cloud data lakes, transform after loading using SQL/Spark |
|
|
40
|
+
|
|
41
|
+
### Pipeline Design Principles
|
|
42
|
+
- **Idempotency** — Running the same pipeline twice produces the same result
|
|
43
|
+
- **Incremental processing** — Process only new/changed data, not full reloads
|
|
44
|
+
- **Schema evolution** — Handle schema changes gracefully (add columns, not remove)
|
|
45
|
+
- **Backfill capability** — Ability to reprocess historical data
|
|
46
|
+
- **Monitoring** — Track pipeline health, data quality, processing lag
|
|
47
|
+
|
|
48
|
+
### Pipeline Architecture
|
|
49
|
+
|
|
50
|
+
```
|
|
51
|
+
Source → Extract → Validate → Transform → Load → Verify
|
|
52
|
+
↓
|
|
53
|
+
Dead Letter Queue (failed records)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Error Handling Strategies
|
|
57
|
+
- **Skip and log** — Log bad records, continue processing (good for analytics)
|
|
58
|
+
- **Dead letter queue** — Route failures to a separate queue for manual review
|
|
59
|
+
- **Fail fast** — Stop pipeline on first error (good for critical data)
|
|
60
|
+
- **Retry with backoff** — Retry transient errors with exponential backoff
|
|
61
|
+
|
|
62
|
+
## Data Validation
|
|
63
|
+
|
|
64
|
+
### Schema Enforcement
|
|
65
|
+
- Validate data types, required fields, and constraints at ingestion
|
|
66
|
+
- Use schema registries (Avro, Protobuf, JSON Schema) for contract enforcement
|
|
67
|
+
- Version schemas — never break backward compatibility
|
|
68
|
+
|
|
69
|
+
### Validation Layers
|
|
70
|
+
|
|
71
|
+
| Layer | What to Check | Example |
|
|
72
|
+
|-------|---------------|---------|
|
|
73
|
+
| **Structural** | Schema conformance, types, required fields | Missing `email` field, wrong type |
|
|
74
|
+
| **Semantic** | Business rules, value ranges, relationships | Age < 0, end_date before start_date |
|
|
75
|
+
| **Referential** | Foreign key integrity, cross-dataset consistency | Order references non-existent customer |
|
|
76
|
+
| **Statistical** | Distribution anomalies, volume checks | 10x fewer records than yesterday |
|
|
77
|
+
|
|
78
|
+
### Data Quality Dimensions
|
|
79
|
+
- **Completeness** — Are all required fields populated?
|
|
80
|
+
- **Accuracy** — Do values reflect reality?
|
|
81
|
+
- **Consistency** — Are the same facts represented the same way?
|
|
82
|
+
- **Timeliness** — Is data available when needed?
|
|
83
|
+
- **Uniqueness** — Are there duplicate records?
|
|
84
|
+
|
|
85
|
+
## Idempotency Patterns
|
|
86
|
+
|
|
87
|
+
### Why Idempotency Matters
|
|
88
|
+
Pipelines fail and retry. Without idempotency, retries cause:
|
|
89
|
+
- Duplicate records in the target
|
|
90
|
+
- Incorrect aggregations (double-counting)
|
|
91
|
+
- Inconsistent state
|
|
92
|
+
|
|
93
|
+
### Patterns
|
|
94
|
+
|
|
95
|
+
| Pattern | How It Works | Trade-off |
|
|
96
|
+
|---------|-------------|-----------|
|
|
97
|
+
| **Upsert (MERGE)** | Insert or update based on key | Requires natural/business key |
|
|
98
|
+
| **Delete + Insert** | Delete partition, then insert | Simple but risky window of missing data |
|
|
99
|
+
| **Deduplication** | Assign unique IDs, deduplicate at read or write | Extra storage for IDs |
|
|
100
|
+
| **Exactly-once semantics** | Transactional writes with offset tracking | Complex, framework-dependent |
|
|
101
|
+
| **Tombstone + Compact** | Write delete markers, compact later | Kafka log compaction pattern |
|
|
102
|
+
|
|
103
|
+
### Idempotency Keys
|
|
104
|
+
- Use deterministic IDs: `hash(source + key + timestamp)`
|
|
105
|
+
- Store processing watermarks: "last processed offset/timestamp"
|
|
106
|
+
- Use database transactions: read offset + write data atomically
|
|
107
|
+
|
|
108
|
+
## Message Queue Patterns
|
|
109
|
+
|
|
110
|
+
### When to Use Which
|
|
111
|
+
|
|
112
|
+
| Queue | Best For | Key Feature |
|
|
113
|
+
|-------|----------|-------------|
|
|
114
|
+
| **Kafka** | High-throughput event streaming, log-based | Durable, ordered, replayable |
|
|
115
|
+
| **RabbitMQ** | Task queues, RPC, complex routing | Flexible routing, acknowledgments |
|
|
116
|
+
| **SQS** | Simple cloud-native queuing | Managed, auto-scaling, no ops |
|
|
117
|
+
| **Redis Streams** | Lightweight streaming with existing Redis | Low latency, familiar API |
|
|
118
|
+
| **NATS** | High-performance pub/sub | Ultra-low latency, cloud-native |
|
|
119
|
+
|
|
120
|
+
### Consumer Patterns
|
|
121
|
+
|
|
122
|
+
| Pattern | Description | Use Case |
|
|
123
|
+
|---------|-------------|----------|
|
|
124
|
+
| **Competing consumers** | Multiple consumers share a queue | Parallel task processing |
|
|
125
|
+
| **Fan-out** | One message delivered to all consumers | Event notifications |
|
|
126
|
+
| **Consumer groups** | Partitioned consumption across group members | Kafka-style parallel processing |
|
|
127
|
+
| **Request-reply** | Send request, await response on reply queue | Async RPC |
|
|
128
|
+
|
|
129
|
+
### Delivery Guarantees
|
|
130
|
+
|
|
131
|
+
| Guarantee | Meaning | Trade-off |
|
|
132
|
+
|-----------|---------|-----------|
|
|
133
|
+
| **At-most-once** | Message may be lost, never duplicated | Fastest, lossy |
|
|
134
|
+
| **At-least-once** | Message never lost, may be duplicated | Requires idempotent consumers |
|
|
135
|
+
| **Exactly-once** | Message processed exactly once | Complex, performance overhead |
|
|
136
|
+
|
|
137
|
+
### Backpressure Handling
|
|
138
|
+
- **Bounded queues** — Reject/block producers when queue is full
|
|
139
|
+
- **Rate limiting** — Limit consumer processing rate
|
|
140
|
+
- **Circuit breaker** — Stop consuming when downstream is unhealthy
|
|
141
|
+
- **Autoscaling** — Add consumers when queue depth exceeds threshold
|
|
142
|
+
|
|
143
|
+
## Streaming Patterns
|
|
144
|
+
|
|
145
|
+
### Windowing
|
|
146
|
+
|
|
147
|
+
| Window Type | Description | Use Case |
|
|
148
|
+
|-------------|-------------|----------|
|
|
149
|
+
| **Tumbling** | Fixed-size, non-overlapping | Hourly aggregation |
|
|
150
|
+
| **Sliding** | Fixed-size, overlapping | Moving average |
|
|
151
|
+
| **Session** | Gap-based, variable size | User session activity |
|
|
152
|
+
| **Global** | All events in one window | Running totals |
|
|
153
|
+
|
|
154
|
+
### Event Time vs Processing Time
|
|
155
|
+
- **Event time** — When the event actually occurred (embedded in data)
|
|
156
|
+
- **Processing time** — When the system processes the event
|
|
157
|
+
- **Watermarks** — Track progress through event time, handle late arrivals
|
|
158
|
+
- Always prefer event time for correctness; use processing time only for real-time approximation
|
|
159
|
+
|
|
160
|
+
### Stateful Stream Processing
|
|
161
|
+
- **State stores** — Local key-value stores for aggregations, joins
|
|
162
|
+
- **Changelog topics** — Back up state to Kafka for fault tolerance
|
|
163
|
+
- **State checkpointing** — Periodic snapshots for recovery (Flink pattern)
|
|
164
|
+
|
|
165
|
+
## Data Partitioning & Sharding
|
|
166
|
+
|
|
167
|
+
### Partitioning Strategies
|
|
168
|
+
|
|
169
|
+
| Strategy | How | Best For |
|
|
170
|
+
|----------|-----|----------|
|
|
171
|
+
| **Range partitioning** | Partition by value range (date, ID range) | Time-series data, sequential access |
|
|
172
|
+
| **Hash partitioning** | Hash key modulo partition count | Even distribution, point lookups |
|
|
173
|
+
| **List partitioning** | Partition by discrete values (country, region) | Known categories, geographic data |
|
|
174
|
+
| **Composite** | Combine strategies (hash + range) | Multi-tenant time-series |
|
|
175
|
+
|
|
176
|
+
### Partition Key Selection
|
|
177
|
+
- Choose keys with **high cardinality** (many distinct values)
|
|
178
|
+
- Avoid **hot partitions** (one key getting disproportionate traffic)
|
|
179
|
+
- Consider **query patterns** — partition by how data is most often read
|
|
180
|
+
- Plan for **partition growth** — avoid partition count that requires redistribution
|
|
181
|
+
|
|
182
|
+
### Sharding Considerations
|
|
183
|
+
- **Shard key immutability** — Changing a shard key requires data migration
|
|
184
|
+
- **Cross-shard queries** — Avoid joins across shards (denormalize instead)
|
|
185
|
+
- **Rebalancing** — Use consistent hashing to minimize data movement
|
|
186
|
+
- **Shard splitting** — Plan for splitting hot shards without downtime
|
|
187
|
+
|
|
188
|
+
## Data Pipeline Tools
|
|
189
|
+
|
|
190
|
+
### Orchestration
|
|
191
|
+
- **Airflow** — DAG-based workflow orchestration (Python)
|
|
192
|
+
- **Dagster** — Software-defined assets, strong typing
|
|
193
|
+
- **Prefect** — Python-native, dynamic workflows
|
|
194
|
+
- **Temporal** — Durable execution for long-running pipelines
|
|
195
|
+
|
|
196
|
+
### Transformation
|
|
197
|
+
- **dbt** — SQL-based transformations in the warehouse
|
|
198
|
+
- **Spark** — Distributed processing for large datasets
|
|
199
|
+
- **Pandas/Polars** — Single-machine data transformation
|
|
200
|
+
- **Flink** — Stream and batch processing (JVM)
|
|
201
|
+
|
|
202
|
+
### Storage
|
|
203
|
+
- **Data Lake** — Raw, unstructured (S3, GCS, ADLS)
|
|
204
|
+
- **Data Warehouse** — Structured, optimized for analytics (BigQuery, Snowflake, Redshift)
|
|
205
|
+
- **Data Lakehouse** — Combines both (Delta Lake, Iceberg, Hudi)
|
|
206
|
+
|
|
207
|
+
## Checklist
|
|
208
|
+
|
|
209
|
+
When building a data pipeline:
|
|
210
|
+
- [ ] Idempotent operations — safe to retry without side effects
|
|
211
|
+
- [ ] Schema validation at ingestion boundary
|
|
212
|
+
- [ ] Dead letter queue for failed records
|
|
213
|
+
- [ ] Monitoring: processing lag, error rate, throughput
|
|
214
|
+
- [ ] Backfill capability — can reprocess historical data
|
|
215
|
+
- [ ] Incremental processing — not full reloads on every run
|
|
216
|
+
- [ ] Data quality checks after transformation
|
|
217
|
+
- [ ] Partition strategy aligned with query patterns
|
|
218
|
+
- [ ] Exactly-once or at-least-once with idempotent consumers
|
|
219
|
+
- [ ] Schema evolution plan (backward compatible changes)
|
|
220
|
+
- [ ] Alerting on pipeline failures and data quality anomalies
|
|
221
|
+
- [ ] Documentation of data lineage and transformation logic
|