agentic-team-templates 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +280 -0
- package/bin/cli.js +5 -0
- package/package.json +47 -0
- package/src/index.js +521 -0
- package/templates/_shared/code-quality.md +162 -0
- package/templates/_shared/communication.md +114 -0
- package/templates/_shared/core-principles.md +62 -0
- package/templates/_shared/git-workflow.md +165 -0
- package/templates/_shared/security-fundamentals.md +173 -0
- package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
- package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
- package/templates/blockchain/.cursorrules/overview.md +130 -0
- package/templates/blockchain/.cursorrules/security.md +318 -0
- package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
- package/templates/blockchain/.cursorrules/testing.md +415 -0
- package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
- package/templates/blockchain/CLAUDE.md +389 -0
- package/templates/cli-tools/.cursorrules/architecture.md +412 -0
- package/templates/cli-tools/.cursorrules/arguments.md +406 -0
- package/templates/cli-tools/.cursorrules/distribution.md +546 -0
- package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
- package/templates/cli-tools/.cursorrules/overview.md +136 -0
- package/templates/cli-tools/.cursorrules/testing.md +537 -0
- package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
- package/templates/cli-tools/CLAUDE.md +356 -0
- package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
- package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
- package/templates/data-engineering/.cursorrules/overview.md +85 -0
- package/templates/data-engineering/.cursorrules/performance.md +339 -0
- package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
- package/templates/data-engineering/.cursorrules/security.md +460 -0
- package/templates/data-engineering/.cursorrules/testing.md +452 -0
- package/templates/data-engineering/CLAUDE.md +974 -0
- package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
- package/templates/devops-sre/.cursorrules/change-management.md +584 -0
- package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
- package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
- package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
- package/templates/devops-sre/.cursorrules/observability.md +714 -0
- package/templates/devops-sre/.cursorrules/overview.md +230 -0
- package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
- package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
- package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
- package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
- package/templates/devops-sre/CLAUDE.md +1007 -0
- package/templates/documentation/.cursorrules/adr.md +277 -0
- package/templates/documentation/.cursorrules/api-documentation.md +411 -0
- package/templates/documentation/.cursorrules/code-comments.md +253 -0
- package/templates/documentation/.cursorrules/maintenance.md +260 -0
- package/templates/documentation/.cursorrules/overview.md +82 -0
- package/templates/documentation/.cursorrules/readme-standards.md +306 -0
- package/templates/documentation/CLAUDE.md +120 -0
- package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
- package/templates/fullstack/.cursorrules/architecture.md +298 -0
- package/templates/fullstack/.cursorrules/overview.md +109 -0
- package/templates/fullstack/.cursorrules/shared-types.md +348 -0
- package/templates/fullstack/.cursorrules/testing.md +386 -0
- package/templates/fullstack/CLAUDE.md +349 -0
- package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
- package/templates/ml-ai/.cursorrules/deployment.md +601 -0
- package/templates/ml-ai/.cursorrules/model-development.md +538 -0
- package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
- package/templates/ml-ai/.cursorrules/overview.md +131 -0
- package/templates/ml-ai/.cursorrules/security.md +637 -0
- package/templates/ml-ai/.cursorrules/testing.md +678 -0
- package/templates/ml-ai/CLAUDE.md +1136 -0
- package/templates/mobile/.cursorrules/navigation.md +246 -0
- package/templates/mobile/.cursorrules/offline-first.md +302 -0
- package/templates/mobile/.cursorrules/overview.md +71 -0
- package/templates/mobile/.cursorrules/performance.md +345 -0
- package/templates/mobile/.cursorrules/testing.md +339 -0
- package/templates/mobile/CLAUDE.md +233 -0
- package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
- package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
- package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
- package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
- package/templates/platform-engineering/.cursorrules/observability.md +747 -0
- package/templates/platform-engineering/.cursorrules/overview.md +215 -0
- package/templates/platform-engineering/.cursorrules/security.md +855 -0
- package/templates/platform-engineering/.cursorrules/testing.md +878 -0
- package/templates/platform-engineering/CLAUDE.md +850 -0
- package/templates/utility-agent/.cursorrules/action-control.md +284 -0
- package/templates/utility-agent/.cursorrules/context-management.md +186 -0
- package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
- package/templates/utility-agent/.cursorrules/overview.md +78 -0
- package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
- package/templates/utility-agent/CLAUDE.md +513 -0
- package/templates/web-backend/.cursorrules/api-design.md +255 -0
- package/templates/web-backend/.cursorrules/authentication.md +309 -0
- package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
- package/templates/web-backend/.cursorrules/error-handling.md +366 -0
- package/templates/web-backend/.cursorrules/overview.md +69 -0
- package/templates/web-backend/.cursorrules/security.md +358 -0
- package/templates/web-backend/.cursorrules/testing.md +395 -0
- package/templates/web-backend/CLAUDE.md +366 -0
- package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
- package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
- package/templates/web-frontend/.cursorrules/overview.md +72 -0
- package/templates/web-frontend/.cursorrules/performance.md +325 -0
- package/templates/web-frontend/.cursorrules/state-management.md +227 -0
- package/templates/web-frontend/.cursorrules/styling.md +271 -0
- package/templates/web-frontend/.cursorrules/testing.md +311 -0
- package/templates/web-frontend/CLAUDE.md +399 -0
|
@@ -0,0 +1,455 @@
|
|
|
1
|
+
# Data Quality
|
|
2
|
+
|
|
3
|
+
Patterns for validating, monitoring, and ensuring data quality.
|
|
4
|
+
|
|
5
|
+
## Quality Dimensions
|
|
6
|
+
|
|
7
|
+
| Dimension | Definition | Example Check |
|
|
8
|
+
|-----------|------------|---------------|
|
|
9
|
+
| **Completeness** | Required data is present | No null values in required fields |
|
|
10
|
+
| **Accuracy** | Data reflects reality | Prices within expected range |
|
|
11
|
+
| **Consistency** | Data agrees across systems | Order totals match line items |
|
|
12
|
+
| **Timeliness** | Data is fresh enough | Table updated within SLA |
|
|
13
|
+
| **Uniqueness** | No unwanted duplicates | Primary keys are unique |
|
|
14
|
+
| **Validity** | Data conforms to rules | Email matches regex pattern |
|
|
15
|
+
|
|
16
|
+
## Validation Patterns
|
|
17
|
+
|
|
18
|
+
### Schema Validation
|
|
19
|
+
|
|
20
|
+
Enforce expected schema before processing.
|
|
21
|
+
|
|
22
|
+
```python
|
|
23
|
+
from pyspark.sql.types import StructType, StructField, StringType, DecimalType, DateType
|
|
24
|
+
|
|
25
|
+
expected_schema = StructType([
|
|
26
|
+
StructField("order_id", StringType(), nullable=False),
|
|
27
|
+
StructField("customer_id", StringType(), nullable=False),
|
|
28
|
+
StructField("order_date", DateType(), nullable=False),
|
|
29
|
+
StructField("total_amount", DecimalType(12, 2), nullable=False),
|
|
30
|
+
])
|
|
31
|
+
|
|
32
|
+
def validate_schema(df: DataFrame) -> DataFrame:
|
|
33
|
+
"""Validate DataFrame matches expected schema."""
|
|
34
|
+
actual_fields = {f.name: f for f in df.schema.fields}
|
|
35
|
+
|
|
36
|
+
for expected_field in expected_schema.fields:
|
|
37
|
+
if expected_field.name not in actual_fields:
|
|
38
|
+
raise SchemaError(f"Missing column: {expected_field.name}")
|
|
39
|
+
|
|
40
|
+
actual = actual_fields[expected_field.name]
|
|
41
|
+
if actual.dataType != expected_field.dataType:
|
|
42
|
+
raise SchemaError(
|
|
43
|
+
f"Type mismatch for {expected_field.name}: "
|
|
44
|
+
f"expected {expected_field.dataType}, got {actual.dataType}"
|
|
45
|
+
)
|
|
46
|
+
|
|
47
|
+
return df
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
### Null Checks
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
def check_required_columns(df: DataFrame, required: list[str]) -> DataFrame:
|
|
54
|
+
"""Ensure required columns have no nulls."""
|
|
55
|
+
for column in required:
|
|
56
|
+
null_count = df.filter(F.col(column).isNull()).count()
|
|
57
|
+
if null_count > 0:
|
|
58
|
+
raise DataQualityError(f"Found {null_count} nulls in required column: {column}")
|
|
59
|
+
return df
|
|
60
|
+
|
|
61
|
+
# Usage
|
|
62
|
+
df = check_required_columns(df, ["order_id", "customer_id", "order_date"])
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### Uniqueness Checks
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
def check_uniqueness(df: DataFrame, key_columns: list[str]) -> DataFrame:
|
|
69
|
+
"""Ensure no duplicates on key columns."""
|
|
70
|
+
duplicate_count = (
|
|
71
|
+
df.groupBy(key_columns)
|
|
72
|
+
.count()
|
|
73
|
+
.filter("count > 1")
|
|
74
|
+
.count()
|
|
75
|
+
)
|
|
76
|
+
|
|
77
|
+
if duplicate_count > 0:
|
|
78
|
+
raise DataQualityError(f"Found {duplicate_count} duplicate keys on {key_columns}")
|
|
79
|
+
|
|
80
|
+
return df
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Range Checks
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
def check_ranges(df: DataFrame, ranges: dict[str, tuple]) -> DataFrame:
|
|
87
|
+
"""Validate numeric columns are within expected ranges."""
|
|
88
|
+
for column, (min_val, max_val) in ranges.items():
|
|
89
|
+
out_of_range = df.filter(
|
|
90
|
+
(F.col(column) < min_val) | (F.col(column) > max_val)
|
|
91
|
+
).count()
|
|
92
|
+
|
|
93
|
+
if out_of_range > 0:
|
|
94
|
+
raise DataQualityError(
|
|
95
|
+
f"Found {out_of_range} values out of range [{min_val}, {max_val}] in {column}"
|
|
96
|
+
)
|
|
97
|
+
|
|
98
|
+
return df
|
|
99
|
+
|
|
100
|
+
# Usage
|
|
101
|
+
df = check_ranges(df, {
|
|
102
|
+
"quantity": (1, 10000),
|
|
103
|
+
"unit_price": (0.01, 100000),
|
|
104
|
+
"discount_pct": (0, 100),
|
|
105
|
+
})
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Referential Integrity
|
|
109
|
+
|
|
110
|
+
```python
|
|
111
|
+
def check_referential_integrity(
|
|
112
|
+
df: DataFrame,
|
|
113
|
+
foreign_key: str,
|
|
114
|
+
reference_table: str,
|
|
115
|
+
reference_key: str,
|
|
116
|
+
) -> DataFrame:
|
|
117
|
+
"""Ensure foreign keys exist in reference table."""
|
|
118
|
+
reference_df = spark.table(reference_table).select(reference_key).distinct()
|
|
119
|
+
|
|
120
|
+
orphans = (
|
|
121
|
+
df.select(foreign_key)
|
|
122
|
+
.distinct()
|
|
123
|
+
.join(reference_df, df[foreign_key] == reference_df[reference_key], "left_anti")
|
|
124
|
+
)
|
|
125
|
+
|
|
126
|
+
orphan_count = orphans.count()
|
|
127
|
+
if orphan_count > 0:
|
|
128
|
+
raise DataQualityError(
|
|
129
|
+
f"Found {orphan_count} orphan keys in {foreign_key} not in {reference_table}"
|
|
130
|
+
)
|
|
131
|
+
|
|
132
|
+
return df
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## Great Expectations Integration
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
from great_expectations.dataset import SparkDFDataset
|
|
139
|
+
|
|
140
|
+
def validate_orders(df: DataFrame) -> DataFrame:
|
|
141
|
+
"""Apply comprehensive data quality checks."""
|
|
142
|
+
ge_df = SparkDFDataset(df)
|
|
143
|
+
|
|
144
|
+
results = ge_df.validate(expectation_suite={
|
|
145
|
+
"expectations": [
|
|
146
|
+
# Completeness
|
|
147
|
+
{"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "order_id"}},
|
|
148
|
+
{"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "customer_id"}},
|
|
149
|
+
|
|
150
|
+
# Uniqueness
|
|
151
|
+
{"expectation_type": "expect_column_values_to_be_unique", "kwargs": {"column": "order_id"}},
|
|
152
|
+
|
|
153
|
+
# Validity
|
|
154
|
+
{"expectation_type": "expect_column_values_to_match_regex",
|
|
155
|
+
"kwargs": {"column": "email", "regex": r"^[\w.-]+@[\w.-]+\.\w+$"}},
|
|
156
|
+
|
|
157
|
+
# Range
|
|
158
|
+
{"expectation_type": "expect_column_values_to_be_between",
|
|
159
|
+
"kwargs": {"column": "total_amount", "min_value": 0, "max_value": 1000000}},
|
|
160
|
+
|
|
161
|
+
# Set membership
|
|
162
|
+
{"expectation_type": "expect_column_values_to_be_in_set",
|
|
163
|
+
"kwargs": {"column": "status", "value_set": ["pending", "confirmed", "shipped", "delivered"]}},
|
|
164
|
+
]
|
|
165
|
+
})
|
|
166
|
+
|
|
167
|
+
if not results.success:
|
|
168
|
+
failed = [r for r in results.results if not r.success]
|
|
169
|
+
raise DataQualityError(f"Validation failed: {failed}")
|
|
170
|
+
|
|
171
|
+
return df
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## DBT Tests
|
|
175
|
+
|
|
176
|
+
### Built-in Tests
|
|
177
|
+
|
|
178
|
+
```yaml
|
|
179
|
+
# models/schema.yml
|
|
180
|
+
version: 2
|
|
181
|
+
|
|
182
|
+
models:
|
|
183
|
+
- name: orders
|
|
184
|
+
columns:
|
|
185
|
+
- name: order_id
|
|
186
|
+
tests:
|
|
187
|
+
- not_null
|
|
188
|
+
- unique
|
|
189
|
+
|
|
190
|
+
- name: customer_id
|
|
191
|
+
tests:
|
|
192
|
+
- not_null
|
|
193
|
+
- relationships:
|
|
194
|
+
to: ref('customers')
|
|
195
|
+
field: customer_id
|
|
196
|
+
|
|
197
|
+
- name: status
|
|
198
|
+
tests:
|
|
199
|
+
- accepted_values:
|
|
200
|
+
values: ['pending', 'confirmed', 'shipped', 'delivered']
|
|
201
|
+
|
|
202
|
+
- name: total_amount
|
|
203
|
+
tests:
|
|
204
|
+
- not_null
|
|
205
|
+
- dbt_utils.accepted_range:
|
|
206
|
+
min_value: 0
|
|
207
|
+
max_value: 1000000
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Custom Tests
|
|
211
|
+
|
|
212
|
+
```sql
|
|
213
|
+
-- tests/assert_order_totals_match_line_items.sql
|
|
214
|
+
-- Ensure order totals equal sum of line items
|
|
215
|
+
|
|
216
|
+
SELECT o.order_id
|
|
217
|
+
FROM {{ ref('orders') }} o
|
|
218
|
+
LEFT JOIN (
|
|
219
|
+
SELECT order_id, SUM(quantity * unit_price) as calculated_total
|
|
220
|
+
FROM {{ ref('order_items') }}
|
|
221
|
+
GROUP BY order_id
|
|
222
|
+
) li ON o.order_id = li.order_id
|
|
223
|
+
WHERE ABS(o.total_amount - li.calculated_total) > 0.01
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
## Data Freshness Monitoring
|
|
227
|
+
|
|
228
|
+
### Freshness Checks
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
@dataclass
|
|
232
|
+
class FreshnessConfig:
|
|
233
|
+
table: str
|
|
234
|
+
timestamp_column: str
|
|
235
|
+
max_delay_hours: int
|
|
236
|
+
severity: str # "critical" | "warning"
|
|
237
|
+
|
|
238
|
+
FRESHNESS_CHECKS = [
|
|
239
|
+
FreshnessConfig("curated.orders", "order_date", 2, "critical"),
|
|
240
|
+
FreshnessConfig("curated.inventory", "updated_at", 1, "critical"),
|
|
241
|
+
FreshnessConfig("marts.daily_sales", "report_date", 24, "warning"),
|
|
242
|
+
]
|
|
243
|
+
|
|
244
|
+
def check_freshness() -> list[Alert]:
|
|
245
|
+
"""Monitor data freshness and alert on SLA breaches."""
|
|
246
|
+
alerts = []
|
|
247
|
+
|
|
248
|
+
for config in FRESHNESS_CHECKS:
|
|
249
|
+
result = spark.sql(f"""
|
|
250
|
+
SELECT
|
|
251
|
+
MAX({config.timestamp_column}) as max_ts,
|
|
252
|
+
TIMESTAMPDIFF(HOUR, MAX({config.timestamp_column}), CURRENT_TIMESTAMP) as delay_hours
|
|
253
|
+
FROM {config.table}
|
|
254
|
+
""").collect()[0]
|
|
255
|
+
|
|
256
|
+
if result["delay_hours"] > config.max_delay_hours:
|
|
257
|
+
alerts.append(Alert(
|
|
258
|
+
severity=config.severity,
|
|
259
|
+
table=config.table,
|
|
260
|
+
message=f"Data is {result['delay_hours']}h stale (SLA: {config.max_delay_hours}h)",
|
|
261
|
+
))
|
|
262
|
+
|
|
263
|
+
return alerts
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
### DBT Source Freshness
|
|
267
|
+
|
|
268
|
+
```yaml
|
|
269
|
+
# models/sources.yml
|
|
270
|
+
version: 2
|
|
271
|
+
|
|
272
|
+
sources:
|
|
273
|
+
- name: raw
|
|
274
|
+
database: raw_db
|
|
275
|
+
tables:
|
|
276
|
+
- name: orders
|
|
277
|
+
freshness:
|
|
278
|
+
warn_after: {count: 2, period: hour}
|
|
279
|
+
error_after: {count: 6, period: hour}
|
|
280
|
+
loaded_at_field: _loaded_at
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
## Anomaly Detection
|
|
284
|
+
|
|
285
|
+
### Volume Anomalies
|
|
286
|
+
|
|
287
|
+
```python
|
|
288
|
+
def detect_volume_anomaly(
|
|
289
|
+
table: str,
|
|
290
|
+
partition_column: str,
|
|
291
|
+
lookback_days: int = 30,
|
|
292
|
+
threshold_std: float = 3.0,
|
|
293
|
+
) -> Optional[Alert]:
|
|
294
|
+
"""Detect unusual record counts."""
|
|
295
|
+
stats = spark.sql(f"""
|
|
296
|
+
WITH daily_counts AS (
|
|
297
|
+
SELECT {partition_column}, COUNT(*) as cnt
|
|
298
|
+
FROM {table}
|
|
299
|
+
WHERE {partition_column} >= CURRENT_DATE - INTERVAL {lookback_days} DAYS
|
|
300
|
+
GROUP BY {partition_column}
|
|
301
|
+
)
|
|
302
|
+
SELECT
|
|
303
|
+
AVG(cnt) as mean_count,
|
|
304
|
+
STDDEV(cnt) as std_count,
|
|
305
|
+
(SELECT cnt FROM daily_counts WHERE {partition_column} = CURRENT_DATE) as today_count
|
|
306
|
+
FROM daily_counts
|
|
307
|
+
WHERE {partition_column} < CURRENT_DATE
|
|
308
|
+
""").collect()[0]
|
|
309
|
+
|
|
310
|
+
z_score = abs(stats["today_count"] - stats["mean_count"]) / stats["std_count"]
|
|
311
|
+
|
|
312
|
+
if z_score > threshold_std:
|
|
313
|
+
return Alert(
|
|
314
|
+
severity="warning",
|
|
315
|
+
message=f"Volume anomaly: {stats['today_count']} records "
|
|
316
|
+
f"(expected {stats['mean_count']:.0f} ± {stats['std_count']:.0f})"
|
|
317
|
+
)
|
|
318
|
+
return None
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
### Distribution Drift
|
|
322
|
+
|
|
323
|
+
```python
|
|
324
|
+
def detect_distribution_drift(
|
|
325
|
+
table: str,
|
|
326
|
+
column: str,
|
|
327
|
+
baseline_table: str,
|
|
328
|
+
threshold: float = 0.1,
|
|
329
|
+
) -> Optional[Alert]:
|
|
330
|
+
"""Detect changes in value distribution using KL divergence."""
|
|
331
|
+
current_dist = spark.sql(f"""
|
|
332
|
+
SELECT {column}, COUNT(*) / SUM(COUNT(*)) OVER() as pct
|
|
333
|
+
FROM {table}
|
|
334
|
+
WHERE _loaded_at >= CURRENT_DATE
|
|
335
|
+
GROUP BY {column}
|
|
336
|
+
""").toPandas().set_index(column)["pct"]
|
|
337
|
+
|
|
338
|
+
baseline_dist = spark.sql(f"""
|
|
339
|
+
SELECT {column}, COUNT(*) / SUM(COUNT(*)) OVER() as pct
|
|
340
|
+
FROM {baseline_table}
|
|
341
|
+
GROUP BY {column}
|
|
342
|
+
""").toPandas().set_index(column)["pct"]
|
|
343
|
+
|
|
344
|
+
# Calculate KL divergence
|
|
345
|
+
kl_divergence = sum(
|
|
346
|
+
current_dist.get(k, 0.001) * np.log(current_dist.get(k, 0.001) / baseline_dist.get(k, 0.001))
|
|
347
|
+
for k in baseline_dist.index
|
|
348
|
+
)
|
|
349
|
+
|
|
350
|
+
if kl_divergence > threshold:
|
|
351
|
+
return Alert(
|
|
352
|
+
severity="warning",
|
|
353
|
+
message=f"Distribution drift detected in {column}: KL={kl_divergence:.3f}"
|
|
354
|
+
)
|
|
355
|
+
return None
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
## Data Quality Metrics
|
|
359
|
+
|
|
360
|
+
### Track Quality Over Time
|
|
361
|
+
|
|
362
|
+
```python
|
|
363
|
+
def record_quality_metrics(
|
|
364
|
+
table: str,
|
|
365
|
+
df: DataFrame,
|
|
366
|
+
checks: dict[str, int], # check_name -> failure_count
|
|
367
|
+
) -> None:
|
|
368
|
+
"""Record quality metrics for trending."""
|
|
369
|
+
total_rows = df.count()
|
|
370
|
+
|
|
371
|
+
metrics_df = spark.createDataFrame([{
|
|
372
|
+
"table": table,
|
|
373
|
+
"check_timestamp": datetime.utcnow(),
|
|
374
|
+
"total_rows": total_rows,
|
|
375
|
+
**{f"{name}_failures": count for name, count in checks.items()},
|
|
376
|
+
**{f"{name}_pct": count / total_rows * 100 for name, count in checks.items()},
|
|
377
|
+
}])
|
|
378
|
+
|
|
379
|
+
metrics_df.write.mode("append").saveAsTable("metrics.data_quality")
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
### Quality Dashboard Query
|
|
383
|
+
|
|
384
|
+
```sql
|
|
385
|
+
-- Quality trends over time
|
|
386
|
+
SELECT
|
|
387
|
+
table,
|
|
388
|
+
DATE(check_timestamp) as check_date,
|
|
389
|
+
AVG(null_failures_pct) as avg_null_pct,
|
|
390
|
+
AVG(duplicate_failures_pct) as avg_dup_pct,
|
|
391
|
+
AVG(range_failures_pct) as avg_range_pct
|
|
392
|
+
FROM metrics.data_quality
|
|
393
|
+
WHERE check_timestamp >= CURRENT_DATE - 30
|
|
394
|
+
GROUP BY table, DATE(check_timestamp)
|
|
395
|
+
ORDER BY check_date DESC;
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
## Best Practices
|
|
399
|
+
|
|
400
|
+
### Fail Fast on Critical Issues
|
|
401
|
+
|
|
402
|
+
```python
|
|
403
|
+
def validate_with_severity(df: DataFrame) -> DataFrame:
|
|
404
|
+
"""Apply checks with different severity levels."""
|
|
405
|
+
|
|
406
|
+
# Critical: Pipeline must fail
|
|
407
|
+
critical_checks = [
|
|
408
|
+
("order_id not null", df.filter("order_id IS NULL").count() == 0),
|
|
409
|
+
("no duplicates", df.groupBy("order_id").count().filter("count > 1").count() == 0),
|
|
410
|
+
]
|
|
411
|
+
|
|
412
|
+
for name, passed in critical_checks:
|
|
413
|
+
if not passed:
|
|
414
|
+
raise DataQualityError(f"Critical check failed: {name}")
|
|
415
|
+
|
|
416
|
+
# Warning: Log but continue
|
|
417
|
+
warning_checks = [
|
|
418
|
+
("email valid", df.filter("email NOT RLIKE '^[\\w.-]+@[\\w.-]+\\.\\w+$'").count()),
|
|
419
|
+
("future dates", df.filter("order_date > CURRENT_DATE").count()),
|
|
420
|
+
]
|
|
421
|
+
|
|
422
|
+
for name, failure_count in warning_checks:
|
|
423
|
+
if failure_count > 0:
|
|
424
|
+
logger.warning(f"Quality warning - {name}: {failure_count} failures")
|
|
425
|
+
metrics.increment(f"quality.warnings.{name}", failure_count)
|
|
426
|
+
|
|
427
|
+
return df
|
|
428
|
+
```
|
|
429
|
+
|
|
430
|
+
### Quarantine Bad Records
|
|
431
|
+
|
|
432
|
+
```python
|
|
433
|
+
def process_with_quarantine(df: DataFrame) -> DataFrame:
|
|
434
|
+
"""Separate good and bad records."""
|
|
435
|
+
|
|
436
|
+
good_records = df.filter("""
|
|
437
|
+
order_id IS NOT NULL
|
|
438
|
+
AND customer_id IS NOT NULL
|
|
439
|
+
AND total_amount >= 0
|
|
440
|
+
""")
|
|
441
|
+
|
|
442
|
+
bad_records = df.subtract(good_records)
|
|
443
|
+
|
|
444
|
+
if bad_records.count() > 0:
|
|
445
|
+
# Write to quarantine for investigation
|
|
446
|
+
(bad_records
|
|
447
|
+
.withColumn("_quarantine_reason", F.lit("validation_failed"))
|
|
448
|
+
.withColumn("_quarantine_timestamp", F.current_timestamp())
|
|
449
|
+
.write.mode("append")
|
|
450
|
+
.saveAsTable("quarantine.orders"))
|
|
451
|
+
|
|
452
|
+
logger.warning(f"Quarantined {bad_records.count()} records")
|
|
453
|
+
|
|
454
|
+
return good_records
|
|
455
|
+
```
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
# Data Engineering
|
|
2
|
+
|
|
3
|
+
Guidelines for building robust, scalable data platforms and pipelines.
|
|
4
|
+
|
|
5
|
+
## Scope
|
|
6
|
+
|
|
7
|
+
This ruleset applies to:
|
|
8
|
+
|
|
9
|
+
- Batch data pipelines (ETL/ELT)
|
|
10
|
+
- Stream processing applications
|
|
11
|
+
- Data warehouses and lakehouses
|
|
12
|
+
- Data platform infrastructure
|
|
13
|
+
- Analytics engineering (DBT, SQL models)
|
|
14
|
+
- Data quality and observability systems
|
|
15
|
+
|
|
16
|
+
## Core Technologies
|
|
17
|
+
|
|
18
|
+
Data engineering typically involves:
|
|
19
|
+
|
|
20
|
+
- **Orchestration**: Airflow, Dagster, Prefect, Temporal
|
|
21
|
+
- **Batch Processing**: Spark, DBT, Pandas, Polars, SQL
|
|
22
|
+
- **Stream Processing**: Kafka, Flink, Spark Streaming
|
|
23
|
+
- **Storage**: Delta Lake, Iceberg, Parquet, cloud object storage
|
|
24
|
+
- **Warehouses**: Snowflake, BigQuery, Redshift, Databricks
|
|
25
|
+
- **Quality**: Great Expectations, Soda, DBT Tests
|
|
26
|
+
|
|
27
|
+
## Key Principles
|
|
28
|
+
|
|
29
|
+
### 1. Idempotency Is Non-Negotiable
|
|
30
|
+
|
|
31
|
+
Every pipeline must produce identical results when re-run with the same inputs. This enables safe retries, backfills, and debugging.
|
|
32
|
+
|
|
33
|
+
### 2. Data Quality Is a Feature
|
|
34
|
+
|
|
35
|
+
Data quality isn't an afterthought—it's a core feature. Validate early, monitor continuously, and alert proactively.
|
|
36
|
+
|
|
37
|
+
### 3. Schema Is a Contract
|
|
38
|
+
|
|
39
|
+
Schema changes can break downstream consumers. Treat schemas as versioned contracts that require coordination.
|
|
40
|
+
|
|
41
|
+
### 4. Observability Over Debugging
|
|
42
|
+
|
|
43
|
+
Instrument everything in production. If you can't observe it, you can't operate it. Never debug production data issues by running ad-hoc queries.
|
|
44
|
+
|
|
45
|
+
### 5. Cost-Aware Engineering
|
|
46
|
+
|
|
47
|
+
Compute and storage have real costs. Understand cost implications of design decisions. Optimize deliberately, not prematurely.
|
|
48
|
+
|
|
49
|
+
## Project Structure
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
data-platform/
|
|
53
|
+
├── pipelines/ # Pipeline definitions
|
|
54
|
+
│ ├── ingestion/ # Source → Raw layer
|
|
55
|
+
│ ├── transformation/ # Raw → Curated layer
|
|
56
|
+
│ └── serving/ # Curated → Consumption layer
|
|
57
|
+
├── models/ # DBT or SQL models
|
|
58
|
+
│ ├── staging/ # 1:1 source mappings
|
|
59
|
+
│ ├── intermediate/ # Business logic transforms
|
|
60
|
+
│ └── marts/ # Consumption-ready tables
|
|
61
|
+
├── schemas/ # Schema definitions
|
|
62
|
+
├── quality/ # Data quality checks
|
|
63
|
+
├── tests/ # Pipeline tests
|
|
64
|
+
└── docs/ # Documentation
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Data Architecture Layers
|
|
68
|
+
|
|
69
|
+
| Layer | Also Known As | Purpose | Freshness |
|
|
70
|
+
|-------|---------------|---------|-----------|
|
|
71
|
+
| **Raw/Bronze** | Landing, Staging | Exact copy of source | Minutes-Hours |
|
|
72
|
+
| **Curated/Silver** | Cleaned, Conformed | Validated, typed, deduplicated | Hours |
|
|
73
|
+
| **Marts/Gold** | Aggregates, Features | Business-ready datasets | Hours-Daily |
|
|
74
|
+
|
|
75
|
+
## Definition of Done
|
|
76
|
+
|
|
77
|
+
A data pipeline is complete when:
|
|
78
|
+
|
|
79
|
+
- [ ] Produces correct output for all test cases
|
|
80
|
+
- [ ] Idempotency verified (re-run produces same result)
|
|
81
|
+
- [ ] Data quality checks implemented
|
|
82
|
+
- [ ] Schema documented and versioned
|
|
83
|
+
- [ ] Monitoring and alerting configured
|
|
84
|
+
- [ ] Runbook/playbook documented
|
|
85
|
+
- [ ] Cost estimate understood
|