agentic-team-templates 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (103) hide show
  1. package/README.md +280 -0
  2. package/bin/cli.js +5 -0
  3. package/package.json +47 -0
  4. package/src/index.js +521 -0
  5. package/templates/_shared/code-quality.md +162 -0
  6. package/templates/_shared/communication.md +114 -0
  7. package/templates/_shared/core-principles.md +62 -0
  8. package/templates/_shared/git-workflow.md +165 -0
  9. package/templates/_shared/security-fundamentals.md +173 -0
  10. package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
  11. package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
  12. package/templates/blockchain/.cursorrules/overview.md +130 -0
  13. package/templates/blockchain/.cursorrules/security.md +318 -0
  14. package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
  15. package/templates/blockchain/.cursorrules/testing.md +415 -0
  16. package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
  17. package/templates/blockchain/CLAUDE.md +389 -0
  18. package/templates/cli-tools/.cursorrules/architecture.md +412 -0
  19. package/templates/cli-tools/.cursorrules/arguments.md +406 -0
  20. package/templates/cli-tools/.cursorrules/distribution.md +546 -0
  21. package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
  22. package/templates/cli-tools/.cursorrules/overview.md +136 -0
  23. package/templates/cli-tools/.cursorrules/testing.md +537 -0
  24. package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
  25. package/templates/cli-tools/CLAUDE.md +356 -0
  26. package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
  27. package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
  28. package/templates/data-engineering/.cursorrules/overview.md +85 -0
  29. package/templates/data-engineering/.cursorrules/performance.md +339 -0
  30. package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
  31. package/templates/data-engineering/.cursorrules/security.md +460 -0
  32. package/templates/data-engineering/.cursorrules/testing.md +452 -0
  33. package/templates/data-engineering/CLAUDE.md +974 -0
  34. package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
  35. package/templates/devops-sre/.cursorrules/change-management.md +584 -0
  36. package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
  37. package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
  38. package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
  39. package/templates/devops-sre/.cursorrules/observability.md +714 -0
  40. package/templates/devops-sre/.cursorrules/overview.md +230 -0
  41. package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
  42. package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
  43. package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
  44. package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
  45. package/templates/devops-sre/CLAUDE.md +1007 -0
  46. package/templates/documentation/.cursorrules/adr.md +277 -0
  47. package/templates/documentation/.cursorrules/api-documentation.md +411 -0
  48. package/templates/documentation/.cursorrules/code-comments.md +253 -0
  49. package/templates/documentation/.cursorrules/maintenance.md +260 -0
  50. package/templates/documentation/.cursorrules/overview.md +82 -0
  51. package/templates/documentation/.cursorrules/readme-standards.md +306 -0
  52. package/templates/documentation/CLAUDE.md +120 -0
  53. package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
  54. package/templates/fullstack/.cursorrules/architecture.md +298 -0
  55. package/templates/fullstack/.cursorrules/overview.md +109 -0
  56. package/templates/fullstack/.cursorrules/shared-types.md +348 -0
  57. package/templates/fullstack/.cursorrules/testing.md +386 -0
  58. package/templates/fullstack/CLAUDE.md +349 -0
  59. package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
  60. package/templates/ml-ai/.cursorrules/deployment.md +601 -0
  61. package/templates/ml-ai/.cursorrules/model-development.md +538 -0
  62. package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
  63. package/templates/ml-ai/.cursorrules/overview.md +131 -0
  64. package/templates/ml-ai/.cursorrules/security.md +637 -0
  65. package/templates/ml-ai/.cursorrules/testing.md +678 -0
  66. package/templates/ml-ai/CLAUDE.md +1136 -0
  67. package/templates/mobile/.cursorrules/navigation.md +246 -0
  68. package/templates/mobile/.cursorrules/offline-first.md +302 -0
  69. package/templates/mobile/.cursorrules/overview.md +71 -0
  70. package/templates/mobile/.cursorrules/performance.md +345 -0
  71. package/templates/mobile/.cursorrules/testing.md +339 -0
  72. package/templates/mobile/CLAUDE.md +233 -0
  73. package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
  74. package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
  75. package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
  76. package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
  77. package/templates/platform-engineering/.cursorrules/observability.md +747 -0
  78. package/templates/platform-engineering/.cursorrules/overview.md +215 -0
  79. package/templates/platform-engineering/.cursorrules/security.md +855 -0
  80. package/templates/platform-engineering/.cursorrules/testing.md +878 -0
  81. package/templates/platform-engineering/CLAUDE.md +850 -0
  82. package/templates/utility-agent/.cursorrules/action-control.md +284 -0
  83. package/templates/utility-agent/.cursorrules/context-management.md +186 -0
  84. package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
  85. package/templates/utility-agent/.cursorrules/overview.md +78 -0
  86. package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
  87. package/templates/utility-agent/CLAUDE.md +513 -0
  88. package/templates/web-backend/.cursorrules/api-design.md +255 -0
  89. package/templates/web-backend/.cursorrules/authentication.md +309 -0
  90. package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
  91. package/templates/web-backend/.cursorrules/error-handling.md +366 -0
  92. package/templates/web-backend/.cursorrules/overview.md +69 -0
  93. package/templates/web-backend/.cursorrules/security.md +358 -0
  94. package/templates/web-backend/.cursorrules/testing.md +395 -0
  95. package/templates/web-backend/CLAUDE.md +366 -0
  96. package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
  97. package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
  98. package/templates/web-frontend/.cursorrules/overview.md +72 -0
  99. package/templates/web-frontend/.cursorrules/performance.md +325 -0
  100. package/templates/web-frontend/.cursorrules/state-management.md +227 -0
  101. package/templates/web-frontend/.cursorrules/styling.md +271 -0
  102. package/templates/web-frontend/.cursorrules/testing.md +311 -0
  103. package/templates/web-frontend/CLAUDE.md +399 -0
@@ -0,0 +1,455 @@
1
+ # Data Quality
2
+
3
+ Patterns for validating, monitoring, and ensuring data quality.
4
+
5
+ ## Quality Dimensions
6
+
7
+ | Dimension | Definition | Example Check |
8
+ |-----------|------------|---------------|
9
+ | **Completeness** | Required data is present | No null values in required fields |
10
+ | **Accuracy** | Data reflects reality | Prices within expected range |
11
+ | **Consistency** | Data agrees across systems | Order totals match line items |
12
+ | **Timeliness** | Data is fresh enough | Table updated within SLA |
13
+ | **Uniqueness** | No unwanted duplicates | Primary keys are unique |
14
+ | **Validity** | Data conforms to rules | Email matches regex pattern |
15
+
16
+ ## Validation Patterns
17
+
18
+ ### Schema Validation
19
+
20
+ Enforce expected schema before processing.
21
+
22
+ ```python
23
+ from pyspark.sql.types import StructType, StructField, StringType, DecimalType, DateType
24
+
25
+ expected_schema = StructType([
26
+ StructField("order_id", StringType(), nullable=False),
27
+ StructField("customer_id", StringType(), nullable=False),
28
+ StructField("order_date", DateType(), nullable=False),
29
+ StructField("total_amount", DecimalType(12, 2), nullable=False),
30
+ ])
31
+
32
+ def validate_schema(df: DataFrame) -> DataFrame:
33
+ """Validate DataFrame matches expected schema."""
34
+ actual_fields = {f.name: f for f in df.schema.fields}
35
+
36
+ for expected_field in expected_schema.fields:
37
+ if expected_field.name not in actual_fields:
38
+ raise SchemaError(f"Missing column: {expected_field.name}")
39
+
40
+ actual = actual_fields[expected_field.name]
41
+ if actual.dataType != expected_field.dataType:
42
+ raise SchemaError(
43
+ f"Type mismatch for {expected_field.name}: "
44
+ f"expected {expected_field.dataType}, got {actual.dataType}"
45
+ )
46
+
47
+ return df
48
+ ```
49
+
50
+ ### Null Checks
51
+
52
+ ```python
53
+ def check_required_columns(df: DataFrame, required: list[str]) -> DataFrame:
54
+ """Ensure required columns have no nulls."""
55
+ for column in required:
56
+ null_count = df.filter(F.col(column).isNull()).count()
57
+ if null_count > 0:
58
+ raise DataQualityError(f"Found {null_count} nulls in required column: {column}")
59
+ return df
60
+
61
+ # Usage
62
+ df = check_required_columns(df, ["order_id", "customer_id", "order_date"])
63
+ ```
64
+
65
+ ### Uniqueness Checks
66
+
67
+ ```python
68
+ def check_uniqueness(df: DataFrame, key_columns: list[str]) -> DataFrame:
69
+ """Ensure no duplicates on key columns."""
70
+ duplicate_count = (
71
+ df.groupBy(key_columns)
72
+ .count()
73
+ .filter("count > 1")
74
+ .count()
75
+ )
76
+
77
+ if duplicate_count > 0:
78
+ raise DataQualityError(f"Found {duplicate_count} duplicate keys on {key_columns}")
79
+
80
+ return df
81
+ ```
82
+
83
+ ### Range Checks
84
+
85
+ ```python
86
+ def check_ranges(df: DataFrame, ranges: dict[str, tuple]) -> DataFrame:
87
+ """Validate numeric columns are within expected ranges."""
88
+ for column, (min_val, max_val) in ranges.items():
89
+ out_of_range = df.filter(
90
+ (F.col(column) < min_val) | (F.col(column) > max_val)
91
+ ).count()
92
+
93
+ if out_of_range > 0:
94
+ raise DataQualityError(
95
+ f"Found {out_of_range} values out of range [{min_val}, {max_val}] in {column}"
96
+ )
97
+
98
+ return df
99
+
100
+ # Usage
101
+ df = check_ranges(df, {
102
+ "quantity": (1, 10000),
103
+ "unit_price": (0.01, 100000),
104
+ "discount_pct": (0, 100),
105
+ })
106
+ ```
107
+
108
+ ### Referential Integrity
109
+
110
+ ```python
111
+ def check_referential_integrity(
112
+ df: DataFrame,
113
+ foreign_key: str,
114
+ reference_table: str,
115
+ reference_key: str,
116
+ ) -> DataFrame:
117
+ """Ensure foreign keys exist in reference table."""
118
+ reference_df = spark.table(reference_table).select(reference_key).distinct()
119
+
120
+ orphans = (
121
+ df.select(foreign_key)
122
+ .distinct()
123
+ .join(reference_df, df[foreign_key] == reference_df[reference_key], "left_anti")
124
+ )
125
+
126
+ orphan_count = orphans.count()
127
+ if orphan_count > 0:
128
+ raise DataQualityError(
129
+ f"Found {orphan_count} orphan keys in {foreign_key} not in {reference_table}"
130
+ )
131
+
132
+ return df
133
+ ```
134
+
135
+ ## Great Expectations Integration
136
+
137
+ ```python
138
+ from great_expectations.dataset import SparkDFDataset
139
+
140
+ def validate_orders(df: DataFrame) -> DataFrame:
141
+ """Apply comprehensive data quality checks."""
142
+ ge_df = SparkDFDataset(df)
143
+
144
+ results = ge_df.validate(expectation_suite={
145
+ "expectations": [
146
+ # Completeness
147
+ {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "order_id"}},
148
+ {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "customer_id"}},
149
+
150
+ # Uniqueness
151
+ {"expectation_type": "expect_column_values_to_be_unique", "kwargs": {"column": "order_id"}},
152
+
153
+ # Validity
154
+ {"expectation_type": "expect_column_values_to_match_regex",
155
+ "kwargs": {"column": "email", "regex": r"^[\w.-]+@[\w.-]+\.\w+$"}},
156
+
157
+ # Range
158
+ {"expectation_type": "expect_column_values_to_be_between",
159
+ "kwargs": {"column": "total_amount", "min_value": 0, "max_value": 1000000}},
160
+
161
+ # Set membership
162
+ {"expectation_type": "expect_column_values_to_be_in_set",
163
+ "kwargs": {"column": "status", "value_set": ["pending", "confirmed", "shipped", "delivered"]}},
164
+ ]
165
+ })
166
+
167
+ if not results.success:
168
+ failed = [r for r in results.results if not r.success]
169
+ raise DataQualityError(f"Validation failed: {failed}")
170
+
171
+ return df
172
+ ```
173
+
174
+ ## DBT Tests
175
+
176
+ ### Built-in Tests
177
+
178
+ ```yaml
179
+ # models/schema.yml
180
+ version: 2
181
+
182
+ models:
183
+ - name: orders
184
+ columns:
185
+ - name: order_id
186
+ tests:
187
+ - not_null
188
+ - unique
189
+
190
+ - name: customer_id
191
+ tests:
192
+ - not_null
193
+ - relationships:
194
+ to: ref('customers')
195
+ field: customer_id
196
+
197
+ - name: status
198
+ tests:
199
+ - accepted_values:
200
+ values: ['pending', 'confirmed', 'shipped', 'delivered']
201
+
202
+ - name: total_amount
203
+ tests:
204
+ - not_null
205
+ - dbt_utils.accepted_range:
206
+ min_value: 0
207
+ max_value: 1000000
208
+ ```
209
+
210
+ ### Custom Tests
211
+
212
+ ```sql
213
+ -- tests/assert_order_totals_match_line_items.sql
214
+ -- Ensure order totals equal sum of line items
215
+
216
+ SELECT o.order_id
217
+ FROM {{ ref('orders') }} o
218
+ LEFT JOIN (
219
+ SELECT order_id, SUM(quantity * unit_price) as calculated_total
220
+ FROM {{ ref('order_items') }}
221
+ GROUP BY order_id
222
+ ) li ON o.order_id = li.order_id
223
+ WHERE ABS(o.total_amount - li.calculated_total) > 0.01
224
+ ```
225
+
226
+ ## Data Freshness Monitoring
227
+
228
+ ### Freshness Checks
229
+
230
+ ```python
231
+ @dataclass
232
+ class FreshnessConfig:
233
+ table: str
234
+ timestamp_column: str
235
+ max_delay_hours: int
236
+ severity: str # "critical" | "warning"
237
+
238
+ FRESHNESS_CHECKS = [
239
+ FreshnessConfig("curated.orders", "order_date", 2, "critical"),
240
+ FreshnessConfig("curated.inventory", "updated_at", 1, "critical"),
241
+ FreshnessConfig("marts.daily_sales", "report_date", 24, "warning"),
242
+ ]
243
+
244
+ def check_freshness() -> list[Alert]:
245
+ """Monitor data freshness and alert on SLA breaches."""
246
+ alerts = []
247
+
248
+ for config in FRESHNESS_CHECKS:
249
+ result = spark.sql(f"""
250
+ SELECT
251
+ MAX({config.timestamp_column}) as max_ts,
252
+ TIMESTAMPDIFF(HOUR, MAX({config.timestamp_column}), CURRENT_TIMESTAMP) as delay_hours
253
+ FROM {config.table}
254
+ """).collect()[0]
255
+
256
+ if result["delay_hours"] > config.max_delay_hours:
257
+ alerts.append(Alert(
258
+ severity=config.severity,
259
+ table=config.table,
260
+ message=f"Data is {result['delay_hours']}h stale (SLA: {config.max_delay_hours}h)",
261
+ ))
262
+
263
+ return alerts
264
+ ```
265
+
266
+ ### DBT Source Freshness
267
+
268
+ ```yaml
269
+ # models/sources.yml
270
+ version: 2
271
+
272
+ sources:
273
+ - name: raw
274
+ database: raw_db
275
+ tables:
276
+ - name: orders
277
+ freshness:
278
+ warn_after: {count: 2, period: hour}
279
+ error_after: {count: 6, period: hour}
280
+ loaded_at_field: _loaded_at
281
+ ```
282
+
283
+ ## Anomaly Detection
284
+
285
+ ### Volume Anomalies
286
+
287
+ ```python
288
+ def detect_volume_anomaly(
289
+ table: str,
290
+ partition_column: str,
291
+ lookback_days: int = 30,
292
+ threshold_std: float = 3.0,
293
+ ) -> Optional[Alert]:
294
+ """Detect unusual record counts."""
295
+ stats = spark.sql(f"""
296
+ WITH daily_counts AS (
297
+ SELECT {partition_column}, COUNT(*) as cnt
298
+ FROM {table}
299
+ WHERE {partition_column} >= CURRENT_DATE - INTERVAL {lookback_days} DAYS
300
+ GROUP BY {partition_column}
301
+ )
302
+ SELECT
303
+ AVG(cnt) as mean_count,
304
+ STDDEV(cnt) as std_count,
305
+ (SELECT cnt FROM daily_counts WHERE {partition_column} = CURRENT_DATE) as today_count
306
+ FROM daily_counts
307
+ WHERE {partition_column} < CURRENT_DATE
308
+ """).collect()[0]
309
+
310
+ z_score = abs(stats["today_count"] - stats["mean_count"]) / stats["std_count"]
311
+
312
+ if z_score > threshold_std:
313
+ return Alert(
314
+ severity="warning",
315
+ message=f"Volume anomaly: {stats['today_count']} records "
316
+ f"(expected {stats['mean_count']:.0f} ± {stats['std_count']:.0f})"
317
+ )
318
+ return None
319
+ ```
320
+
321
+ ### Distribution Drift
322
+
323
+ ```python
324
+ def detect_distribution_drift(
325
+ table: str,
326
+ column: str,
327
+ baseline_table: str,
328
+ threshold: float = 0.1,
329
+ ) -> Optional[Alert]:
330
+ """Detect changes in value distribution using KL divergence."""
331
+ current_dist = spark.sql(f"""
332
+ SELECT {column}, COUNT(*) / SUM(COUNT(*)) OVER() as pct
333
+ FROM {table}
334
+ WHERE _loaded_at >= CURRENT_DATE
335
+ GROUP BY {column}
336
+ """).toPandas().set_index(column)["pct"]
337
+
338
+ baseline_dist = spark.sql(f"""
339
+ SELECT {column}, COUNT(*) / SUM(COUNT(*)) OVER() as pct
340
+ FROM {baseline_table}
341
+ GROUP BY {column}
342
+ """).toPandas().set_index(column)["pct"]
343
+
344
+ # Calculate KL divergence
345
+ kl_divergence = sum(
346
+ current_dist.get(k, 0.001) * np.log(current_dist.get(k, 0.001) / baseline_dist.get(k, 0.001))
347
+ for k in baseline_dist.index
348
+ )
349
+
350
+ if kl_divergence > threshold:
351
+ return Alert(
352
+ severity="warning",
353
+ message=f"Distribution drift detected in {column}: KL={kl_divergence:.3f}"
354
+ )
355
+ return None
356
+ ```
357
+
358
+ ## Data Quality Metrics
359
+
360
+ ### Track Quality Over Time
361
+
362
+ ```python
363
+ def record_quality_metrics(
364
+ table: str,
365
+ df: DataFrame,
366
+ checks: dict[str, int], # check_name -> failure_count
367
+ ) -> None:
368
+ """Record quality metrics for trending."""
369
+ total_rows = df.count()
370
+
371
+ metrics_df = spark.createDataFrame([{
372
+ "table": table,
373
+ "check_timestamp": datetime.utcnow(),
374
+ "total_rows": total_rows,
375
+ **{f"{name}_failures": count for name, count in checks.items()},
376
+ **{f"{name}_pct": count / total_rows * 100 for name, count in checks.items()},
377
+ }])
378
+
379
+ metrics_df.write.mode("append").saveAsTable("metrics.data_quality")
380
+ ```
381
+
382
+ ### Quality Dashboard Query
383
+
384
+ ```sql
385
+ -- Quality trends over time
386
+ SELECT
387
+ table,
388
+ DATE(check_timestamp) as check_date,
389
+ AVG(null_failures_pct) as avg_null_pct,
390
+ AVG(duplicate_failures_pct) as avg_dup_pct,
391
+ AVG(range_failures_pct) as avg_range_pct
392
+ FROM metrics.data_quality
393
+ WHERE check_timestamp >= CURRENT_DATE - 30
394
+ GROUP BY table, DATE(check_timestamp)
395
+ ORDER BY check_date DESC;
396
+ ```
397
+
398
+ ## Best Practices
399
+
400
+ ### Fail Fast on Critical Issues
401
+
402
+ ```python
403
+ def validate_with_severity(df: DataFrame) -> DataFrame:
404
+ """Apply checks with different severity levels."""
405
+
406
+ # Critical: Pipeline must fail
407
+ critical_checks = [
408
+ ("order_id not null", df.filter("order_id IS NULL").count() == 0),
409
+ ("no duplicates", df.groupBy("order_id").count().filter("count > 1").count() == 0),
410
+ ]
411
+
412
+ for name, passed in critical_checks:
413
+ if not passed:
414
+ raise DataQualityError(f"Critical check failed: {name}")
415
+
416
+ # Warning: Log but continue
417
+ warning_checks = [
418
+ ("email valid", df.filter("email NOT RLIKE '^[\\w.-]+@[\\w.-]+\\.\\w+$'").count()),
419
+ ("future dates", df.filter("order_date > CURRENT_DATE").count()),
420
+ ]
421
+
422
+ for name, failure_count in warning_checks:
423
+ if failure_count > 0:
424
+ logger.warning(f"Quality warning - {name}: {failure_count} failures")
425
+ metrics.increment(f"quality.warnings.{name}", failure_count)
426
+
427
+ return df
428
+ ```
429
+
430
+ ### Quarantine Bad Records
431
+
432
+ ```python
433
+ def process_with_quarantine(df: DataFrame) -> DataFrame:
434
+ """Separate good and bad records."""
435
+
436
+ good_records = df.filter("""
437
+ order_id IS NOT NULL
438
+ AND customer_id IS NOT NULL
439
+ AND total_amount >= 0
440
+ """)
441
+
442
+ bad_records = df.subtract(good_records)
443
+
444
+ if bad_records.count() > 0:
445
+ # Write to quarantine for investigation
446
+ (bad_records
447
+ .withColumn("_quarantine_reason", F.lit("validation_failed"))
448
+ .withColumn("_quarantine_timestamp", F.current_timestamp())
449
+ .write.mode("append")
450
+ .saveAsTable("quarantine.orders"))
451
+
452
+ logger.warning(f"Quarantined {bad_records.count()} records")
453
+
454
+ return good_records
455
+ ```
@@ -0,0 +1,85 @@
1
+ # Data Engineering
2
+
3
+ Guidelines for building robust, scalable data platforms and pipelines.
4
+
5
+ ## Scope
6
+
7
+ This ruleset applies to:
8
+
9
+ - Batch data pipelines (ETL/ELT)
10
+ - Stream processing applications
11
+ - Data warehouses and lakehouses
12
+ - Data platform infrastructure
13
+ - Analytics engineering (DBT, SQL models)
14
+ - Data quality and observability systems
15
+
16
+ ## Core Technologies
17
+
18
+ Data engineering typically involves:
19
+
20
+ - **Orchestration**: Airflow, Dagster, Prefect, Temporal
21
+ - **Batch Processing**: Spark, DBT, Pandas, Polars, SQL
22
+ - **Stream Processing**: Kafka, Flink, Spark Streaming
23
+ - **Storage**: Delta Lake, Iceberg, Parquet, cloud object storage
24
+ - **Warehouses**: Snowflake, BigQuery, Redshift, Databricks
25
+ - **Quality**: Great Expectations, Soda, DBT Tests
26
+
27
+ ## Key Principles
28
+
29
+ ### 1. Idempotency Is Non-Negotiable
30
+
31
+ Every pipeline must produce identical results when re-run with the same inputs. This enables safe retries, backfills, and debugging.
32
+
33
+ ### 2. Data Quality Is a Feature
34
+
35
+ Data quality isn't an afterthought—it's a core feature. Validate early, monitor continuously, and alert proactively.
36
+
37
+ ### 3. Schema Is a Contract
38
+
39
+ Schema changes can break downstream consumers. Treat schemas as versioned contracts that require coordination.
40
+
41
+ ### 4. Observability Over Debugging
42
+
43
+ Instrument everything in production. If you can't observe it, you can't operate it. Never debug production data issues by running ad-hoc queries.
44
+
45
+ ### 5. Cost-Aware Engineering
46
+
47
+ Compute and storage have real costs. Understand cost implications of design decisions. Optimize deliberately, not prematurely.
48
+
49
+ ## Project Structure
50
+
51
+ ```
52
+ data-platform/
53
+ ├── pipelines/ # Pipeline definitions
54
+ │ ├── ingestion/ # Source → Raw layer
55
+ │ ├── transformation/ # Raw → Curated layer
56
+ │ └── serving/ # Curated → Consumption layer
57
+ ├── models/ # DBT or SQL models
58
+ │ ├── staging/ # 1:1 source mappings
59
+ │ ├── intermediate/ # Business logic transforms
60
+ │ └── marts/ # Consumption-ready tables
61
+ ├── schemas/ # Schema definitions
62
+ ├── quality/ # Data quality checks
63
+ ├── tests/ # Pipeline tests
64
+ └── docs/ # Documentation
65
+ ```
66
+
67
+ ## Data Architecture Layers
68
+
69
+ | Layer | Also Known As | Purpose | Freshness |
70
+ |-------|---------------|---------|-----------|
71
+ | **Raw/Bronze** | Landing, Staging | Exact copy of source | Minutes-Hours |
72
+ | **Curated/Silver** | Cleaned, Conformed | Validated, typed, deduplicated | Hours |
73
+ | **Marts/Gold** | Aggregates, Features | Business-ready datasets | Hours-Daily |
74
+
75
+ ## Definition of Done
76
+
77
+ A data pipeline is complete when:
78
+
79
+ - [ ] Produces correct output for all test cases
80
+ - [ ] Idempotency verified (re-run produces same result)
81
+ - [ ] Data quality checks implemented
82
+ - [ ] Schema documented and versioned
83
+ - [ ] Monitoring and alerting configured
84
+ - [ ] Runbook/playbook documented
85
+ - [ ] Cost estimate understood