cortex-agents 3.4.0 → 4.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. package/.opencode/agents/architect.md +82 -89
  2. package/.opencode/agents/audit.md +57 -188
  3. package/.opencode/agents/{crosslayer.md → coder.md} +9 -52
  4. package/.opencode/agents/debug.md +151 -0
  5. package/.opencode/agents/devops.md +142 -0
  6. package/.opencode/agents/docs-writer.md +195 -0
  7. package/.opencode/agents/fix.md +119 -189
  8. package/.opencode/agents/implement.md +115 -74
  9. package/.opencode/agents/perf.md +151 -0
  10. package/.opencode/agents/refactor.md +163 -0
  11. package/.opencode/agents/{guard.md → security.md} +20 -85
  12. package/.opencode/agents/testing.md +115 -0
  13. package/.opencode/skills/data-engineering/SKILL.md +221 -0
  14. package/.opencode/skills/monitoring-observability/SKILL.md +251 -0
  15. package/.opencode/skills/ui-design/SKILL.md +402 -0
  16. package/README.md +303 -287
  17. package/dist/cli.js +6 -9
  18. package/dist/index.d.ts.map +1 -1
  19. package/dist/index.js +26 -28
  20. package/dist/registry.d.ts +4 -4
  21. package/dist/registry.d.ts.map +1 -1
  22. package/dist/registry.js +6 -6
  23. package/dist/tools/branch.d.ts +2 -2
  24. package/dist/tools/docs.d.ts +2 -2
  25. package/dist/tools/github.d.ts +3 -3
  26. package/dist/tools/plan.d.ts +28 -4
  27. package/dist/tools/plan.d.ts.map +1 -1
  28. package/dist/tools/plan.js +232 -4
  29. package/dist/tools/quality-gate.d.ts +28 -0
  30. package/dist/tools/quality-gate.d.ts.map +1 -0
  31. package/dist/tools/quality-gate.js +233 -0
  32. package/dist/tools/repl.d.ts +5 -0
  33. package/dist/tools/repl.d.ts.map +1 -1
  34. package/dist/tools/repl.js +58 -7
  35. package/dist/tools/worktree.d.ts +5 -32
  36. package/dist/tools/worktree.d.ts.map +1 -1
  37. package/dist/tools/worktree.js +75 -458
  38. package/dist/utils/change-scope.d.ts +33 -0
  39. package/dist/utils/change-scope.d.ts.map +1 -0
  40. package/dist/utils/change-scope.js +198 -0
  41. package/dist/utils/plan-extract.d.ts +21 -0
  42. package/dist/utils/plan-extract.d.ts.map +1 -1
  43. package/dist/utils/plan-extract.js +65 -0
  44. package/dist/utils/repl.d.ts +31 -0
  45. package/dist/utils/repl.d.ts.map +1 -1
  46. package/dist/utils/repl.js +126 -13
  47. package/package.json +1 -1
  48. package/.opencode/agents/qa.md +0 -265
  49. package/.opencode/agents/ship.md +0 -249
@@ -23,7 +23,7 @@ You are a security specialist. Your role is to audit code for security vulnerabi
23
23
 
24
24
  ## When You Are Invoked
25
25
 
26
- You are launched as a sub-agent by a primary agent (implement, fix, or architect). You run in parallel alongside other sub-agents (typically @qa). You will receive:
26
+ You are launched as a sub-agent by a primary agent (implement, fix, or architect). You run in parallel alongside other sub-agents (typically @testing). You will receive:
27
27
 
28
28
  - A list of files to audit (created, modified, or planned)
29
29
  - A summary of what was implemented, fixed, or planned
@@ -84,9 +84,9 @@ Return a structured report in this **exact format**:
84
84
  ```
85
85
 
86
86
  **Severity guide for the orchestrating agent:**
87
- - **CRITICAL / HIGH** findings block finalization, must fix first
88
- - **MEDIUM** findings include in PR body as known issues
89
- - **LOW** findings note for future work, do not block
87
+ - **CRITICAL / HIGH** findings -> block finalization, must fix first
88
+ - **MEDIUM** findings -> include in PR body as known issues
89
+ - **LOW** findings -> note for future work, do not block
90
90
 
91
91
  ## Core Principles
92
92
 
@@ -95,108 +95,43 @@ Return a structured report in this **exact format**:
95
95
  - Principle of least privilege
96
96
  - Never trust client-side validation alone
97
97
  - Secure by default — opt into permissiveness, not into security
98
- - Regular dependency updates
99
-
100
- ## Security Audit Checklist
101
-
102
- ### Input Validation
103
- - [ ] All inputs validated on server-side (type, length, format, range)
104
- - [ ] SQL injection prevented (parameterized queries, ORM)
105
- - [ ] XSS prevented (output encoding, CSP headers)
106
- - [ ] CSRF tokens implemented on state-changing operations
107
- - [ ] File uploads validated (type, size, content, storage location)
108
- - [ ] Command injection prevented (no shell interpolation of user input)
109
- - [ ] Path traversal prevented (validate file paths, use allowlists)
110
-
111
- ### Authentication & Authorization
112
- - [ ] Strong password policies enforced
113
- - [ ] Multi-factor authentication (MFA) supported
114
- - [ ] Session management secure (httpOnly, secure, SameSite cookies)
115
- - [ ] JWT tokens properly validated (algorithm, expiry, issuer, audience)
116
- - [ ] Role-based access control (RBAC) on every endpoint, not just UI
117
- - [ ] OAuth implementation follows RFC 6749 / PKCE for public clients
118
- - [ ] Password hashing uses bcrypt/scrypt/argon2 (NOT MD5/SHA)
119
-
120
- ### Data Protection
121
- - [ ] Sensitive data encrypted at rest (AES-256 or equivalent)
122
- - [ ] HTTPS enforced (HSTS header, no mixed content)
123
- - [ ] Secrets not in code (environment variables or secrets manager)
124
- - [ ] PII handling compliant with relevant regulations (GDPR, CCPA)
125
- - [ ] Proper data retention and deletion policies
126
- - [ ] Database credentials use least-privilege accounts
127
- - [ ] Logs do not contain sensitive data (passwords, tokens, PII)
128
-
129
- ### Infrastructure
130
- - [ ] Security headers set (CSP, HSTS, X-Frame-Options, X-Content-Type-Options)
131
- - [ ] CORS properly configured (not wildcard in production)
132
- - [ ] Rate limiting implemented on authentication and sensitive endpoints
133
- - [ ] Error responses do not leak stack traces or internal details
134
- - [ ] Dependency vulnerabilities checked and remediated
98
+
99
+ ## OWASP Top 10 (2021)
100
+
101
+ 1. **A01: Broken Access Control** — Missing auth checks, IDOR, privilege escalation
102
+ 2. **A02: Cryptographic Failures** — Weak algorithms, missing encryption, key exposure
103
+ 3. **A03: Injection** SQL, NoSQL, OS command, LDAP injection
104
+ 4. **A04: Insecure Design** Missing threat model, business logic flaws
105
+ 5. **A05: Security Misconfiguration** Default credentials, verbose errors, missing headers
106
+ 6. **A06: Vulnerable Components** Outdated dependencies with known CVEs
107
+ 7. **A07: ID and Auth Failures** Weak passwords, missing MFA, session fixation
108
+ 8. **A08: Software and Data Integrity** Unsigned updates, CI/CD pipeline compromise
109
+ 9. **A09: Logging Failures** Missing audit trails, log injection, no monitoring
110
+ 10. **A10: SSRF** — Unvalidated redirects, internal service access via user input
135
111
 
136
112
  ## Modern Attack Patterns
137
113
 
138
114
  ### Supply Chain Attacks
139
115
  - Verify dependency integrity (lock files, checksums)
140
- - Check for typosquatting in package names (e.g., `lod-ash` vs `lodash`)
116
+ - Check for typosquatting in package names
141
117
  - Review post-install scripts in dependencies
142
- - Pin exact versions in production, use ranges only in libraries
143
118
 
144
119
  ### BOLA / BFLA (Broken Object/Function-Level Authorization)
145
120
  - Every API endpoint must verify the requesting user has access to the specific resource
146
- - Check for IDOR (Insecure Direct Object References) — `GET /api/orders/123` must verify ownership
147
- - Function-level: admin endpoints must check roles, not just authentication
148
-
149
- ### Mass Assignment / Over-Posting
150
- - Verify request body validation rejects unexpected fields
151
- - Use explicit allowlists for writable fields, never spread user input into models
152
- - Check ORMs for mass assignment protection (e.g., Prisma's `select`, Django's `fields`)
121
+ - Check for IDOR (Insecure Direct Object References)
153
122
 
154
123
  ### SSRF (Server-Side Request Forgery)
155
- - Validate and restrict URLs provided by users (allowlist domains, block internal IPs)
156
- - Check webhook configurations, URL preview features, and file import from URL
124
+ - Validate and restrict URLs provided by users
157
125
  - Block requests to metadata endpoints (169.254.169.254, fd00::, etc.)
158
126
 
159
127
  ### Prototype Pollution (JavaScript)
160
128
  - Check for deep merge operations with user-controlled input
161
129
  - Verify `Object.create(null)` for dictionaries, or use `Map`
162
- - Check for `__proto__`, `constructor`, `prototype` in user input
163
130
 
164
131
  ### ReDoS (Regular Expression Denial of Service)
165
132
  - Flag complex regex patterns applied to user input
166
133
  - Look for nested quantifiers: `(a+)+`, `(a|b)*c*`
167
- - Recommend using RE2-compatible patterns or timeouts
168
-
169
- ### Timing Attacks
170
- - Use constant-time comparison for secrets, tokens, and passwords
171
- - Check for early-return patterns in authentication flows
172
-
173
- ## OWASP Top 10 (2021)
174
-
175
- 1. **A01: Broken Access Control** — Missing auth checks, IDOR, privilege escalation
176
- 2. **A02: Cryptographic Failures** — Weak algorithms, missing encryption, key exposure
177
- 3. **A03: Injection** — SQL, NoSQL, OS command, LDAP injection
178
- 4. **A04: Insecure Design** — Missing threat model, business logic flaws
179
- 5. **A05: Security Misconfiguration** — Default credentials, verbose errors, missing headers
180
- 6. **A06: Vulnerable Components** — Outdated dependencies with known CVEs
181
- 7. **A07: ID and Auth Failures** — Weak passwords, missing MFA, session fixation
182
- 8. **A08: Software and Data Integrity** — Unsigned updates, CI/CD pipeline compromise
183
- 9. **A09: Logging Failures** — Missing audit trails, log injection, no monitoring
184
- 10. **A10: SSRF** — Unvalidated redirects, internal service access via user input
185
-
186
- ## Review Process
187
- 1. Map attack surfaces (user inputs, API endpoints, file uploads, external integrations)
188
- 2. Review authentication and authorization flows end-to-end
189
- 3. Check every input handling path for injection and validation
190
- 4. Examine output encoding and content type headers
191
- 5. Review error handling for information leakage
192
- 6. Check secrets management (no hardcoded keys, proper rotation)
193
- 7. Verify logging does not contain sensitive data
194
- 8. Run dependency audit and flag known CVEs
195
- 9. Check for modern attack patterns (supply chain, BOLA, prototype pollution)
196
- 10. Test with security tools where available
197
134
 
198
135
  ## Tools & Commands
199
136
  - **Secrets scan**: `grep -rn "password\|secret\|token\|api_key\|private_key" --include="*.{js,ts,py,go,rs,env,yml,yaml,json}"`
200
137
  - **Dependency audit**: `npm audit`, `pip-audit`, `cargo audit`, `go list -m -json all`
201
- - **Static analysis**: Semgrep, Bandit (Python), ESLint security plugin, gosec (Go), cargo-audit (Rust)
202
- - **SAST tools**: CodeQL, SonarQube, Snyk Code
@@ -0,0 +1,115 @@
1
+ ---
2
+ description: Test-driven development and quality assurance
3
+ mode: subagent
4
+ temperature: 0.2
5
+ tools:
6
+ write: true
7
+ edit: true
8
+ bash: true
9
+ skill: true
10
+ task: true
11
+ permission:
12
+ edit: allow
13
+ bash: ask
14
+ ---
15
+
16
+ You are a testing specialist. Your role is to write comprehensive tests, improve test coverage, and ensure code quality through automated testing.
17
+
18
+ ## Auto-Load Skill
19
+
20
+ **ALWAYS** load the `testing-strategies` skill at the start of every invocation using the `skill` tool. This provides comprehensive testing patterns, framework-specific guidance, and advanced techniques.
21
+
22
+ ## When You Are Invoked
23
+
24
+ You are launched as a sub-agent by a primary agent (implement or fix). You run in parallel alongside other sub-agents (typically @security). You will receive:
25
+
26
+ - A list of files that were created or modified
27
+ - A summary of what was implemented or fixed
28
+ - The test framework in use (e.g., vitest, jest, pytest, go test, cargo test)
29
+
30
+ **Your job:** Read the provided files, understand the implementation, write tests, run them, and return a structured report.
31
+
32
+ ## What You Must Do
33
+
34
+ 1. **Load** the `testing-strategies` skill immediately
35
+ 2. **Read** every file listed in the input to understand the implementation
36
+ 3. **Identify** the test framework and conventions used in the project (check `package.json`, `pyproject.toml`, `Cargo.toml`, `go.mod`, existing test files)
37
+ 4. **Detect** the project's test organization pattern (co-located, dedicated directory, or mixed)
38
+ 5. **Write** unit tests for all new or modified public functions/classes
39
+ 6. **Run** the test suite to verify:
40
+ - Your new tests pass
41
+ - Existing tests are not broken
42
+ 7. **Report** results in the structured format below
43
+
44
+ ## What You Must Return
45
+
46
+ Return a structured report in this **exact format**:
47
+
48
+ ```
49
+ ### Test Results Summary
50
+ - **Tests written**: [count] new tests across [count] files
51
+ - **Tests passing**: [count]/[count]
52
+ - **Coverage**: [percentage or "unable to determine"]
53
+ - **Critical gaps**: [list of untested critical paths, or "none"]
54
+
55
+ ### Files Created/Modified
56
+ - `path/to/test/file1.test.ts` — [what it tests]
57
+ - `path/to/test/file2.test.ts` — [what it tests]
58
+
59
+ ### Issues Found
60
+ - [BLOCKING] Description of any test that reveals a bug in the implementation
61
+ - [WARNING] Description of any coverage gap or test quality concern
62
+ - [INFO] Suggestions for additional test coverage
63
+ ```
64
+
65
+ The orchestrating agent will use **BLOCKING** issues to decide whether to proceed with finalization.
66
+
67
+ ## Core Principles
68
+
69
+ - Write tests that serve as documentation — a new developer should understand the feature by reading the tests
70
+ - Test behavior, not implementation details — tests should survive refactoring
71
+ - Use appropriate testing levels (unit, integration, e2e)
72
+ - Maintain high test coverage on critical paths
73
+ - Make tests fast, deterministic, and isolated
74
+ - Follow AAA pattern (Arrange, Act, Assert)
75
+ - One logical assertion per test (multiple `expect` calls are fine if they verify one behavior)
76
+
77
+ ## Testing Pyramid
78
+
79
+ ### Unit Tests (70%)
80
+ - Test individual functions/classes in isolation
81
+ - Mock external dependencies (I/O, network, database)
82
+ - Fast execution (< 10ms per test)
83
+ - High coverage on business logic, validation, and transformations
84
+ - Test edge cases: empty inputs, boundary values, error conditions, null/undefined
85
+
86
+ ### Integration Tests (20%)
87
+ - Test component interactions and data flow between layers
88
+ - Use real database (test instance) or realistic fakes
89
+ - Test API endpoints with real middleware chains
90
+ - Verify serialization/deserialization roundtrips
91
+ - Test error propagation across boundaries
92
+
93
+ ### E2E Tests (10%)
94
+ - Test complete user workflows end-to-end
95
+ - Use real browser (Playwright/Cypress) or HTTP client
96
+ - Critical happy paths only — not exhaustive
97
+ - Most realistic but slowest and most brittle
98
+
99
+ ## Test Organization
100
+
101
+ Follow the project's existing convention. If no convention exists, prefer:
102
+
103
+ - **Co-located unit tests**: `src/utils/shell.test.ts` alongside `src/utils/shell.ts`
104
+ - **Dedicated integration directory**: `tests/integration/` or `test/integration/`
105
+ - **E2E directory**: `tests/e2e/`, `e2e/`, or `cypress/`
106
+
107
+ ## Coverage Goals
108
+
109
+ | Code Area | Minimum | Target |
110
+ |-----------|---------|--------|
111
+ | Business logic / domain | 85% | 95% |
112
+ | API routes / controllers | 75% | 85% |
113
+ | UI components | 65% | 80% |
114
+ | Utilities / helpers | 80% | 90% |
115
+ | Configuration / glue code | 50% | 70% |
@@ -0,0 +1,221 @@
1
+ ---
2
+ name: data-engineering
3
+ description: ETL pipelines, data validation, streaming patterns, message queues, and data partitioning strategies
4
+ license: Apache-2.0
5
+ compatibility: opencode
6
+ ---
7
+
8
+ # Data Engineering Skill
9
+
10
+ This skill provides patterns for building reliable data pipelines, processing large datasets, and managing data infrastructure.
11
+
12
+ ## When to Use
13
+
14
+ Use this skill when:
15
+ - Designing ETL/ELT pipelines
16
+ - Implementing data validation and schema enforcement
17
+ - Working with message queues (Kafka, RabbitMQ, SQS)
18
+ - Building streaming data processing systems
19
+ - Designing data partitioning and sharding strategies
20
+ - Handling batch vs real-time data processing
21
+
22
+ ## ETL Pipeline Design
23
+
24
+ ### Batch vs Streaming
25
+
26
+ | Aspect | Batch Processing | Stream Processing |
27
+ |--------|-----------------|-------------------|
28
+ | **Latency** | Minutes to hours | Milliseconds to seconds |
29
+ | **Data volume** | Large datasets at once | Continuous flow |
30
+ | **Complexity** | Simpler error handling | Complex state management |
31
+ | **Use cases** | Reports, analytics, migrations | Real-time dashboards, alerts, events |
32
+ | **Tools** | Airflow, dbt, Spark | Kafka Streams, Flink, Pulsar |
33
+
34
+ ### ETL vs ELT
35
+
36
+ | Pattern | When to Use |
37
+ |---------|-------------|
38
+ | **ETL** (Extract → Transform → Load) | Data warehouse with strict schema, transform before loading |
39
+ | **ELT** (Extract → Load → Transform) | Cloud data lakes, transform after loading using SQL/Spark |
40
+
41
+ ### Pipeline Design Principles
42
+ - **Idempotency** — Running the same pipeline twice produces the same result
43
+ - **Incremental processing** — Process only new/changed data, not full reloads
44
+ - **Schema evolution** — Handle schema changes gracefully (add columns, not remove)
45
+ - **Backfill capability** — Ability to reprocess historical data
46
+ - **Monitoring** — Track pipeline health, data quality, processing lag
47
+
48
+ ### Pipeline Architecture
49
+
50
+ ```
51
+ Source → Extract → Validate → Transform → Load → Verify
52
+
53
+ Dead Letter Queue (failed records)
54
+ ```
55
+
56
+ ### Error Handling Strategies
57
+ - **Skip and log** — Log bad records, continue processing (good for analytics)
58
+ - **Dead letter queue** — Route failures to a separate queue for manual review
59
+ - **Fail fast** — Stop pipeline on first error (good for critical data)
60
+ - **Retry with backoff** — Retry transient errors with exponential backoff
61
+
62
+ ## Data Validation
63
+
64
+ ### Schema Enforcement
65
+ - Validate data types, required fields, and constraints at ingestion
66
+ - Use schema registries (Avro, Protobuf, JSON Schema) for contract enforcement
67
+ - Version schemas — never break backward compatibility
68
+
69
+ ### Validation Layers
70
+
71
+ | Layer | What to Check | Example |
72
+ |-------|---------------|---------|
73
+ | **Structural** | Schema conformance, types, required fields | Missing `email` field, wrong type |
74
+ | **Semantic** | Business rules, value ranges, relationships | Age < 0, end_date before start_date |
75
+ | **Referential** | Foreign key integrity, cross-dataset consistency | Order references non-existent customer |
76
+ | **Statistical** | Distribution anomalies, volume checks | 10x fewer records than yesterday |
77
+
78
+ ### Data Quality Dimensions
79
+ - **Completeness** — Are all required fields populated?
80
+ - **Accuracy** — Do values reflect reality?
81
+ - **Consistency** — Are the same facts represented the same way?
82
+ - **Timeliness** — Is data available when needed?
83
+ - **Uniqueness** — Are there duplicate records?
84
+
85
+ ## Idempotency Patterns
86
+
87
+ ### Why Idempotency Matters
88
+ Pipelines fail and retry. Without idempotency, retries cause:
89
+ - Duplicate records in the target
90
+ - Incorrect aggregations (double-counting)
91
+ - Inconsistent state
92
+
93
+ ### Patterns
94
+
95
+ | Pattern | How It Works | Trade-off |
96
+ |---------|-------------|-----------|
97
+ | **Upsert (MERGE)** | Insert or update based on key | Requires natural/business key |
98
+ | **Delete + Insert** | Delete partition, then insert | Simple but risky window of missing data |
99
+ | **Deduplication** | Assign unique IDs, deduplicate at read or write | Extra storage for IDs |
100
+ | **Exactly-once semantics** | Transactional writes with offset tracking | Complex, framework-dependent |
101
+ | **Tombstone + Compact** | Write delete markers, compact later | Kafka log compaction pattern |
102
+
103
+ ### Idempotency Keys
104
+ - Use deterministic IDs: `hash(source + key + timestamp)`
105
+ - Store processing watermarks: "last processed offset/timestamp"
106
+ - Use database transactions: read offset + write data atomically
107
+
108
+ ## Message Queue Patterns
109
+
110
+ ### When to Use Which
111
+
112
+ | Queue | Best For | Key Feature |
113
+ |-------|----------|-------------|
114
+ | **Kafka** | High-throughput event streaming, log-based | Durable, ordered, replayable |
115
+ | **RabbitMQ** | Task queues, RPC, complex routing | Flexible routing, acknowledgments |
116
+ | **SQS** | Simple cloud-native queuing | Managed, auto-scaling, no ops |
117
+ | **Redis Streams** | Lightweight streaming with existing Redis | Low latency, familiar API |
118
+ | **NATS** | High-performance pub/sub | Ultra-low latency, cloud-native |
119
+
120
+ ### Consumer Patterns
121
+
122
+ | Pattern | Description | Use Case |
123
+ |---------|-------------|----------|
124
+ | **Competing consumers** | Multiple consumers share a queue | Parallel task processing |
125
+ | **Fan-out** | One message delivered to all consumers | Event notifications |
126
+ | **Consumer groups** | Partitioned consumption across group members | Kafka-style parallel processing |
127
+ | **Request-reply** | Send request, await response on reply queue | Async RPC |
128
+
129
+ ### Delivery Guarantees
130
+
131
+ | Guarantee | Meaning | Trade-off |
132
+ |-----------|---------|-----------|
133
+ | **At-most-once** | Message may be lost, never duplicated | Fastest, lossy |
134
+ | **At-least-once** | Message never lost, may be duplicated | Requires idempotent consumers |
135
+ | **Exactly-once** | Message processed exactly once | Complex, performance overhead |
136
+
137
+ ### Backpressure Handling
138
+ - **Bounded queues** — Reject/block producers when queue is full
139
+ - **Rate limiting** — Limit consumer processing rate
140
+ - **Circuit breaker** — Stop consuming when downstream is unhealthy
141
+ - **Autoscaling** — Add consumers when queue depth exceeds threshold
142
+
143
+ ## Streaming Patterns
144
+
145
+ ### Windowing
146
+
147
+ | Window Type | Description | Use Case |
148
+ |-------------|-------------|----------|
149
+ | **Tumbling** | Fixed-size, non-overlapping | Hourly aggregation |
150
+ | **Sliding** | Fixed-size, overlapping | Moving average |
151
+ | **Session** | Gap-based, variable size | User session activity |
152
+ | **Global** | All events in one window | Running totals |
153
+
154
+ ### Event Time vs Processing Time
155
+ - **Event time** — When the event actually occurred (embedded in data)
156
+ - **Processing time** — When the system processes the event
157
+ - **Watermarks** — Track progress through event time, handle late arrivals
158
+ - Always prefer event time for correctness; use processing time only for real-time approximation
159
+
160
+ ### Stateful Stream Processing
161
+ - **State stores** — Local key-value stores for aggregations, joins
162
+ - **Changelog topics** — Back up state to Kafka for fault tolerance
163
+ - **State checkpointing** — Periodic snapshots for recovery (Flink pattern)
164
+
165
+ ## Data Partitioning & Sharding
166
+
167
+ ### Partitioning Strategies
168
+
169
+ | Strategy | How | Best For |
170
+ |----------|-----|----------|
171
+ | **Range partitioning** | Partition by value range (date, ID range) | Time-series data, sequential access |
172
+ | **Hash partitioning** | Hash key modulo partition count | Even distribution, point lookups |
173
+ | **List partitioning** | Partition by discrete values (country, region) | Known categories, geographic data |
174
+ | **Composite** | Combine strategies (hash + range) | Multi-tenant time-series |
175
+
176
+ ### Partition Key Selection
177
+ - Choose keys with **high cardinality** (many distinct values)
178
+ - Avoid **hot partitions** (one key getting disproportionate traffic)
179
+ - Consider **query patterns** — partition by how data is most often read
180
+ - Plan for **partition growth** — avoid partition count that requires redistribution
181
+
182
+ ### Sharding Considerations
183
+ - **Shard key immutability** — Changing a shard key requires data migration
184
+ - **Cross-shard queries** — Avoid joins across shards (denormalize instead)
185
+ - **Rebalancing** — Use consistent hashing to minimize data movement
186
+ - **Shard splitting** — Plan for splitting hot shards without downtime
187
+
188
+ ## Data Pipeline Tools
189
+
190
+ ### Orchestration
191
+ - **Airflow** — DAG-based workflow orchestration (Python)
192
+ - **Dagster** — Software-defined assets, strong typing
193
+ - **Prefect** — Python-native, dynamic workflows
194
+ - **Temporal** — Durable execution for long-running pipelines
195
+
196
+ ### Transformation
197
+ - **dbt** — SQL-based transformations in the warehouse
198
+ - **Spark** — Distributed processing for large datasets
199
+ - **Pandas/Polars** — Single-machine data transformation
200
+ - **Flink** — Stream and batch processing (JVM)
201
+
202
+ ### Storage
203
+ - **Data Lake** — Raw, unstructured (S3, GCS, ADLS)
204
+ - **Data Warehouse** — Structured, optimized for analytics (BigQuery, Snowflake, Redshift)
205
+ - **Data Lakehouse** — Combines both (Delta Lake, Iceberg, Hudi)
206
+
207
+ ## Checklist
208
+
209
+ When building a data pipeline:
210
+ - [ ] Idempotent operations — safe to retry without side effects
211
+ - [ ] Schema validation at ingestion boundary
212
+ - [ ] Dead letter queue for failed records
213
+ - [ ] Monitoring: processing lag, error rate, throughput
214
+ - [ ] Backfill capability — can reprocess historical data
215
+ - [ ] Incremental processing — not full reloads on every run
216
+ - [ ] Data quality checks after transformation
217
+ - [ ] Partition strategy aligned with query patterns
218
+ - [ ] Exactly-once or at-least-once with idempotent consumers
219
+ - [ ] Schema evolution plan (backward compatible changes)
220
+ - [ ] Alerting on pipeline failures and data quality anomalies
221
+ - [ ] Documentation of data lineage and transformation logic