@zigrivers/scaffold 3.8.0 → 3.9.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +75 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +167 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,406 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-testing
|
|
3
|
+
description: Unit tests for transforms, integration tests, data quality tests, and performance tests for data pipelines
|
|
4
|
+
topics: [data-pipeline, testing, unit-tests, integration-tests, data-quality-tests, performance-tests, tdd, pytest]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Data pipeline testing requires a layered strategy: fast unit tests for transformation logic, integration tests against containerized infrastructure, data quality tests that run inline in production, and performance tests to catch throughput regressions. The most common testing gap in pipeline projects is that DAG files and orchestration logic are tested but transformation functions — the business logic — are not. Transformation functions must be the most heavily tested component because correctness of results depends entirely on them.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Test transformation functions as pure functions with unit tests: no infrastructure, no side effects, fast execution. Test pipeline end-to-end integration against containerized Kafka, Postgres, and object storage. Run data quality tests inline as pipeline gates using Great Expectations or dbt tests. Test performance against a realistic data volume to catch throughput regressions before they reach production. Target 90%+ coverage on `src/transforms/`.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Unit Tests for Transforms
|
|
16
|
+
|
|
17
|
+
Transformation functions are pure functions — they take data in, return data out. This makes them trivially unit-testable:
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
# src/transforms/payments/normalize.py
|
|
21
|
+
def normalize_transaction_amounts(records: list[dict]) -> list[dict]:
|
|
22
|
+
"""Normalize all transaction amounts to USD using embedded exchange rates."""
|
|
23
|
+
result = []
|
|
24
|
+
for record in records:
|
|
25
|
+
normalized = {**record}
|
|
26
|
+
normalized["amount_usd"] = convert_to_usd(
|
|
27
|
+
record["amount"],
|
|
28
|
+
record.get("currency", "USD"),
|
|
29
|
+
)
|
|
30
|
+
result.append(normalized)
|
|
31
|
+
return result
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
# tests/unit/transforms/payments/test_normalize.py
|
|
36
|
+
import pytest
|
|
37
|
+
from src.transforms.payments.normalize import normalize_transaction_amounts
|
|
38
|
+
|
|
39
|
+
class TestNormalizeTransactionAmounts:
|
|
40
|
+
def test_usd_passthrough(self):
|
|
41
|
+
"""USD transactions are not converted."""
|
|
42
|
+
records = [{"amount": 100.0, "currency": "USD"}]
|
|
43
|
+
result = normalize_transaction_amounts(records)
|
|
44
|
+
assert result[0]["amount_usd"] == 100.0
|
|
45
|
+
|
|
46
|
+
def test_eur_conversion(self):
|
|
47
|
+
"""EUR amounts are converted to USD."""
|
|
48
|
+
records = [{"amount": 100.0, "currency": "EUR"}]
|
|
49
|
+
result = normalize_transaction_amounts(records)
|
|
50
|
+
assert result[0]["amount_usd"] > 0
|
|
51
|
+
assert result[0]["amount_usd"] != 100.0 # was converted
|
|
52
|
+
|
|
53
|
+
def test_missing_currency_defaults_to_usd(self):
|
|
54
|
+
"""Records without currency field default to USD."""
|
|
55
|
+
records = [{"amount": 50.0}]
|
|
56
|
+
result = normalize_transaction_amounts(records)
|
|
57
|
+
assert result[0]["amount_usd"] == 50.0
|
|
58
|
+
|
|
59
|
+
def test_original_record_unchanged(self):
|
|
60
|
+
"""Transform does not mutate input records."""
|
|
61
|
+
original = {"amount": 100.0, "currency": "EUR"}
|
|
62
|
+
records = [original]
|
|
63
|
+
normalize_transaction_amounts(records)
|
|
64
|
+
assert original == {"amount": 100.0, "currency": "EUR"}
|
|
65
|
+
|
|
66
|
+
def test_empty_input_returns_empty(self):
|
|
67
|
+
"""Empty list returns empty list."""
|
|
68
|
+
assert normalize_transaction_amounts([]) == []
|
|
69
|
+
|
|
70
|
+
def test_preserves_all_original_fields(self):
|
|
71
|
+
"""Additional fields on input records are preserved in output."""
|
|
72
|
+
records = [{"amount": 100.0, "currency": "USD", "transaction_id": "txn_abc"}]
|
|
73
|
+
result = normalize_transaction_amounts(records)
|
|
74
|
+
assert result[0]["transaction_id"] == "txn_abc"
|
|
75
|
+
|
|
76
|
+
@pytest.mark.parametrize("amount,currency,expected_range", [
|
|
77
|
+
(0.0, "USD", (0.0, 0.0)),
|
|
78
|
+
(1_000_000.0, "USD", (1_000_000.0, 1_000_000.0)),
|
|
79
|
+
(-1.0, "USD", (-1.0, -1.0)), # negative amounts should be preserved for refunds
|
|
80
|
+
])
|
|
81
|
+
def test_boundary_amounts(self, amount, currency, expected_range):
|
|
82
|
+
"""Test boundary amount values."""
|
|
83
|
+
records = [{"amount": amount, "currency": currency}]
|
|
84
|
+
result = normalize_transaction_amounts(records)
|
|
85
|
+
low, high = expected_range
|
|
86
|
+
assert low <= result[0]["amount_usd"] <= high
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
**Transform test patterns**
|
|
90
|
+
|
|
91
|
+
Test these scenarios for every transform function:
|
|
92
|
+
1. Happy path: valid input produces expected output
|
|
93
|
+
2. Empty input: empty list returns empty list
|
|
94
|
+
3. Null/missing optional fields: handled gracefully with documented defaults
|
|
95
|
+
4. Null/missing required fields: raises expected exception
|
|
96
|
+
5. Boundary values: zero, negative, very large numbers
|
|
97
|
+
6. Input immutability: input records are not mutated
|
|
98
|
+
7. Edge cases specific to the business domain
|
|
99
|
+
|
|
100
|
+
**Fixture-based tests**
|
|
101
|
+
|
|
102
|
+
For complex transformations, use fixture files:
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
# tests/unit/transforms/payments/test_aggregate.py
|
|
106
|
+
import json
|
|
107
|
+
from pathlib import Path
|
|
108
|
+
from src.transforms.payments.aggregate import compute_daily_revenue
|
|
109
|
+
|
|
110
|
+
def load_fixture(name: str) -> list[dict]:
|
|
111
|
+
path = Path(__file__).parent.parent.parent / "fixtures" / name
|
|
112
|
+
return json.loads(path.read_text())
|
|
113
|
+
|
|
114
|
+
def test_daily_revenue_aggregation():
|
|
115
|
+
"""Daily revenue correctly sums transactions by currency."""
|
|
116
|
+
transactions = load_fixture("payments/raw/valid_transactions.json")
|
|
117
|
+
expected = load_fixture("payments/expected/aggregated_revenue.json")
|
|
118
|
+
|
|
119
|
+
result = compute_daily_revenue(transactions, date="2024-01-15")
|
|
120
|
+
|
|
121
|
+
assert result["date"] == "2024-01-15"
|
|
122
|
+
assert result["total_usd"] == pytest.approx(expected["total_usd"], rel=0.001)
|
|
123
|
+
assert result["transaction_count"] == expected["transaction_count"]
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### Unit Tests for Sources and Sinks
|
|
127
|
+
|
|
128
|
+
Sources and sinks interact with external systems. Test them using mocks or test doubles:
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
# tests/unit/sources/test_stripe_reader.py
|
|
132
|
+
from unittest.mock import MagicMock, patch
|
|
133
|
+
from src.sources.stripe.charges import StripeChargesReader
|
|
134
|
+
from datetime import datetime, timezone
|
|
135
|
+
|
|
136
|
+
class TestStripeChargesReader:
|
|
137
|
+
@patch("src.sources.stripe.charges.stripe.Charge.list")
|
|
138
|
+
def test_reads_charges_for_time_window(self, mock_list):
|
|
139
|
+
"""Reader requests charges within the specified time window."""
|
|
140
|
+
mock_list.return_value = MagicMock(
|
|
141
|
+
auto_paging_iter=lambda: iter([
|
|
142
|
+
{"id": "ch_001", "amount": 1000, "currency": "usd", "created": 1705329825},
|
|
143
|
+
])
|
|
144
|
+
)
|
|
145
|
+
|
|
146
|
+
reader = StripeChargesReader(api_key="sk_test_fake")
|
|
147
|
+
start = datetime(2024, 1, 15, 0, 0, tzinfo=timezone.utc)
|
|
148
|
+
end = datetime(2024, 1, 15, 1, 0, tzinfo=timezone.utc)
|
|
149
|
+
|
|
150
|
+
records = list(reader.read(start, end))
|
|
151
|
+
|
|
152
|
+
mock_list.assert_called_once_with(
|
|
153
|
+
created={"gte": int(start.timestamp()), "lt": int(end.timestamp())},
|
|
154
|
+
limit=100,
|
|
155
|
+
)
|
|
156
|
+
assert len(records) == 1
|
|
157
|
+
assert records[0]["id"] == "ch_001"
|
|
158
|
+
|
|
159
|
+
@patch("src.sources.stripe.charges.stripe.Charge.list")
|
|
160
|
+
def test_handles_rate_limit_with_retry(self, mock_list):
|
|
161
|
+
"""Reader retries on rate limit errors."""
|
|
162
|
+
mock_list.side_effect = [
|
|
163
|
+
stripe.error.RateLimitError("Too many requests"),
|
|
164
|
+
MagicMock(auto_paging_iter=lambda: iter([])),
|
|
165
|
+
]
|
|
166
|
+
|
|
167
|
+
reader = StripeChargesReader(api_key="sk_test_fake", max_retries=2)
|
|
168
|
+
records = list(reader.read(datetime.utcnow(), datetime.utcnow()))
|
|
169
|
+
|
|
170
|
+
assert mock_list.call_count == 2 # retried once
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
### Integration Tests
|
|
174
|
+
|
|
175
|
+
Integration tests run the full pipeline end-to-end against real (containerized) infrastructure:
|
|
176
|
+
|
|
177
|
+
```python
|
|
178
|
+
# tests/integration/test_payments_pipeline.py
|
|
179
|
+
import pytest
|
|
180
|
+
import docker
|
|
181
|
+
from datetime import datetime
|
|
182
|
+
from src.pipelines.payments import PaymentsIngestionPipeline
|
|
183
|
+
|
|
184
|
+
@pytest.fixture(scope="session")
|
|
185
|
+
def docker_services(docker_ip, docker_services):
|
|
186
|
+
"""Start required Docker services for integration tests."""
|
|
187
|
+
docker_services.start("kafka")
|
|
188
|
+
docker_services.start("postgres")
|
|
189
|
+
docker_services.wait_until_responsive(
|
|
190
|
+
timeout=60.0,
|
|
191
|
+
pause=0.1,
|
|
192
|
+
check=lambda: is_responsive("localhost", 9092),
|
|
193
|
+
)
|
|
194
|
+
return docker_services
|
|
195
|
+
|
|
196
|
+
@pytest.fixture
|
|
197
|
+
def pipeline(docker_services):
|
|
198
|
+
"""Create pipeline instance connected to test containers."""
|
|
199
|
+
return PaymentsIngestionPipeline(
|
|
200
|
+
source_config={"host": "localhost", "port": 5432, "db": "test_db"},
|
|
201
|
+
sink_config={"bucket": "test-silver", "endpoint": "http://localhost:9000"},
|
|
202
|
+
)
|
|
203
|
+
|
|
204
|
+
def test_pipeline_processes_valid_transactions(pipeline, test_transactions):
|
|
205
|
+
"""Integration test: pipeline reads, transforms, and writes transactions."""
|
|
206
|
+
# Seed source data
|
|
207
|
+
seed_postgres(test_transactions)
|
|
208
|
+
|
|
209
|
+
# Run pipeline for test window
|
|
210
|
+
result = pipeline.run(
|
|
211
|
+
start=datetime(2024, 1, 15, 0, 0),
|
|
212
|
+
end=datetime(2024, 1, 15, 1, 0),
|
|
213
|
+
)
|
|
214
|
+
|
|
215
|
+
# Verify output
|
|
216
|
+
output = read_from_minio("test-silver", "transactions/2024-01-15/00/")
|
|
217
|
+
assert len(output) == len(test_transactions)
|
|
218
|
+
assert all("amount_usd" in record for record in output)
|
|
219
|
+
assert result.records_written == len(test_transactions)
|
|
220
|
+
assert result.dlq_records == 0
|
|
221
|
+
|
|
222
|
+
def test_pipeline_routes_invalid_records_to_dlq(pipeline):
|
|
223
|
+
"""Integration test: invalid records go to DLQ, valid records proceed."""
|
|
224
|
+
mixed_records = [
|
|
225
|
+
{"transaction_id": "txn_001", "amount": 100.0, "currency": "USD", ...},
|
|
226
|
+
{"transaction_id": "txn_002", "amount": "N/A", "currency": "INVALID", ...}, # bad
|
|
227
|
+
{"transaction_id": "txn_003", "amount": 50.0, "currency": "EUR", ...},
|
|
228
|
+
]
|
|
229
|
+
seed_postgres(mixed_records)
|
|
230
|
+
|
|
231
|
+
result = pipeline.run(
|
|
232
|
+
start=datetime(2024, 1, 15, 0, 0),
|
|
233
|
+
end=datetime(2024, 1, 15, 1, 0),
|
|
234
|
+
)
|
|
235
|
+
|
|
236
|
+
assert result.records_written == 2 # valid records
|
|
237
|
+
assert result.dlq_records == 1 # bad record in DLQ
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
**docker-compose test configuration**
|
|
241
|
+
|
|
242
|
+
```yaml
|
|
243
|
+
# docker/docker-compose.test.yml
|
|
244
|
+
version: "3.9"
|
|
245
|
+
services:
|
|
246
|
+
kafka:
|
|
247
|
+
image: confluentinc/cp-kafka:7.5.0
|
|
248
|
+
ports: ["9092:9092"]
|
|
249
|
+
environment:
|
|
250
|
+
KAFKA_BROKER_ID: 1
|
|
251
|
+
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
|
|
252
|
+
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
|
|
253
|
+
KAFKA_LOG_RETENTION_MS: 60000 # short retention for tests
|
|
254
|
+
postgres:
|
|
255
|
+
image: postgres:15
|
|
256
|
+
ports: ["5432:5432"]
|
|
257
|
+
environment:
|
|
258
|
+
POSTGRES_PASSWORD: test
|
|
259
|
+
POSTGRES_DB: test_db
|
|
260
|
+
minio:
|
|
261
|
+
image: minio/minio:latest
|
|
262
|
+
ports: ["9000:9000"]
|
|
263
|
+
command: server /data
|
|
264
|
+
environment:
|
|
265
|
+
MINIO_ROOT_USER: test
|
|
266
|
+
MINIO_ROOT_PASSWORD: testtest
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
### Data Quality Tests
|
|
270
|
+
|
|
271
|
+
Data quality tests run inline in the pipeline and also as standalone test suite:
|
|
272
|
+
|
|
273
|
+
```python
|
|
274
|
+
# tests/quality/test_silver_transactions_quality.py
|
|
275
|
+
import pytest
|
|
276
|
+
import great_expectations as gx
|
|
277
|
+
from tests.fixtures import load_fixture
|
|
278
|
+
|
|
279
|
+
def test_silver_transaction_completeness():
|
|
280
|
+
"""All required fields are present and non-null."""
|
|
281
|
+
df = load_fixture_as_dataframe("payments/silver/valid_transactions.parquet")
|
|
282
|
+
suite = load_expectation_suite("silver_transactions")
|
|
283
|
+
validator = gx.Validator(batch=df, expectation_suite=suite)
|
|
284
|
+
|
|
285
|
+
results = validator.validate()
|
|
286
|
+
failed = [r for r in results.results if not r.success]
|
|
287
|
+
|
|
288
|
+
assert not failed, f"Quality checks failed:\n" + "\n".join(str(f) for f in failed)
|
|
289
|
+
|
|
290
|
+
def test_no_duplicate_transaction_ids():
|
|
291
|
+
"""Transaction IDs are unique in silver layer."""
|
|
292
|
+
df = load_fixture_as_dataframe("payments/silver/valid_transactions.parquet")
|
|
293
|
+
duplicate_count = df["transaction_id"].duplicated().sum()
|
|
294
|
+
assert duplicate_count == 0, f"Found {duplicate_count} duplicate transaction IDs"
|
|
295
|
+
|
|
296
|
+
def test_amount_within_bounds():
|
|
297
|
+
"""All amounts are non-negative and below maximum threshold."""
|
|
298
|
+
df = load_fixture_as_dataframe("payments/silver/valid_transactions.parquet")
|
|
299
|
+
assert (df["amount"] >= 0).all(), "Negative amounts found"
|
|
300
|
+
assert (df["amount"] <= 1_000_000).all(), "Amounts exceeding maximum found"
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
### Performance Tests
|
|
304
|
+
|
|
305
|
+
Performance tests catch throughput regressions before they reach production:
|
|
306
|
+
|
|
307
|
+
```python
|
|
308
|
+
# tests/performance/test_transform_performance.py
|
|
309
|
+
import time
|
|
310
|
+
import pytest
|
|
311
|
+
from src.transforms.payments.normalize import normalize_transaction_amounts
|
|
312
|
+
from tests.fixtures import generate_transactions
|
|
313
|
+
|
|
314
|
+
@pytest.mark.performance
|
|
315
|
+
class TestNormalizePerformance:
|
|
316
|
+
RECORDS = 100_000
|
|
317
|
+
MAX_SECONDS = 10.0 # must normalize 100K records in under 10 seconds
|
|
318
|
+
|
|
319
|
+
def test_throughput(self):
|
|
320
|
+
"""normalize_transaction_amounts processes 100K records in under 10 seconds."""
|
|
321
|
+
records = generate_transactions(self.RECORDS)
|
|
322
|
+
|
|
323
|
+
start = time.perf_counter()
|
|
324
|
+
result = normalize_transaction_amounts(records)
|
|
325
|
+
elapsed = time.perf_counter() - start
|
|
326
|
+
|
|
327
|
+
assert len(result) == self.RECORDS
|
|
328
|
+
assert elapsed < self.MAX_SECONDS, (
|
|
329
|
+
f"Performance regression: normalized {self.RECORDS} records in {elapsed:.2f}s "
|
|
330
|
+
f"(max: {self.MAX_SECONDS}s)"
|
|
331
|
+
)
|
|
332
|
+
|
|
333
|
+
def test_memory_usage(self):
|
|
334
|
+
"""Transform does not exceed 500MB memory for 100K records."""
|
|
335
|
+
import tracemalloc
|
|
336
|
+
tracemalloc.start()
|
|
337
|
+
|
|
338
|
+
records = generate_transactions(self.RECORDS)
|
|
339
|
+
normalize_transaction_amounts(records)
|
|
340
|
+
|
|
341
|
+
current, peak = tracemalloc.get_traced_memory()
|
|
342
|
+
tracemalloc.stop()
|
|
343
|
+
|
|
344
|
+
peak_mb = peak / 1024 / 1024
|
|
345
|
+
assert peak_mb < 500, f"Memory usage {peak_mb:.1f}MB exceeds 500MB limit"
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
**Performance test benchmarks**
|
|
349
|
+
|
|
350
|
+
Run benchmarks with pytest-benchmark to track performance over time:
|
|
351
|
+
|
|
352
|
+
```python
|
|
353
|
+
def test_normalize_benchmark(benchmark):
|
|
354
|
+
"""Benchmark normalization transform for regression tracking."""
|
|
355
|
+
records = generate_transactions(10_000)
|
|
356
|
+
result = benchmark(normalize_transaction_amounts, records)
|
|
357
|
+
assert len(result) == 10_000
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
### Test Coverage and CI Configuration
|
|
361
|
+
|
|
362
|
+
```ini
|
|
363
|
+
# pyproject.toml
|
|
364
|
+
[tool.pytest.ini_options]
|
|
365
|
+
testpaths = ["tests"]
|
|
366
|
+
markers = [
|
|
367
|
+
"unit: fast unit tests",
|
|
368
|
+
"integration: requires Docker services",
|
|
369
|
+
"performance: long-running performance tests",
|
|
370
|
+
"quality: data quality assertion tests",
|
|
371
|
+
]
|
|
372
|
+
addopts = "--strict-markers"
|
|
373
|
+
|
|
374
|
+
[tool.coverage.run]
|
|
375
|
+
source = ["src"]
|
|
376
|
+
branch = true
|
|
377
|
+
omit = ["src/dags/*"] # DAG files excluded; coverage focused on transforms
|
|
378
|
+
|
|
379
|
+
[tool.coverage.report]
|
|
380
|
+
fail_under = 90
|
|
381
|
+
show_missing = true
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
```yaml
|
|
385
|
+
# .github/workflows/test.yml excerpt
|
|
386
|
+
jobs:
|
|
387
|
+
unit-tests:
|
|
388
|
+
runs-on: ubuntu-latest
|
|
389
|
+
steps:
|
|
390
|
+
- run: pytest tests/unit -v --cov=src/transforms --cov-fail-under=90
|
|
391
|
+
|
|
392
|
+
integration-tests:
|
|
393
|
+
runs-on: ubuntu-latest
|
|
394
|
+
steps:
|
|
395
|
+
- run: docker compose -f docker/docker-compose.test.yml up -d
|
|
396
|
+
- run: pytest tests/integration -v -m integration
|
|
397
|
+
- run: docker compose -f docker/docker-compose.test.yml down
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
### Test Data Management Rules
|
|
401
|
+
|
|
402
|
+
1. Never use production data in tests — generate synthetic data or use anonymized fixtures
|
|
403
|
+
2. Test fixtures must be deterministic — no random seeds without explicit seeding
|
|
404
|
+
3. Integration test databases must be isolated per test run — use unique schema or database names
|
|
405
|
+
4. Performance tests must run on the same hardware class as CI to produce comparable benchmarks
|
|
406
|
+
5. DLQ assertions are mandatory: every integration test must assert the expected DLQ record count, not just the success path
|
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-architecture
|
|
3
|
+
description: Training/serving architecture split, feature stores, model registry, online vs offline inference patterns, and ML system design decisions
|
|
4
|
+
topics: [ml, architecture, feature-store, model-registry, inference, serving, training]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML systems have a fundamental architectural split that traditional software does not: the training system and the serving system are different codebases running on different infrastructure, yet they must agree on the exact same data transformations. This training-serving skew is the most common source of silent production bugs in ML. Designing the architecture to prevent skew — through shared feature stores, shared preprocessing libraries, and strict interface contracts — is the most important ML architecture decision.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML architecture separates training (batch, data-intensive, experimental) from serving (latency-sensitive, stateless, reliable). Feature stores eliminate training-serving skew by providing consistent feature computation for both. Model registries provide the governance layer: versioning, lineage, deployment gates, and rollback. Choose online vs. offline inference based on latency requirements and data freshness needs. Document every architectural decision as an ADR.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Training/Serving Architecture Split
|
|
16
|
+
|
|
17
|
+
The training system and serving system have fundamentally different requirements:
|
|
18
|
+
|
|
19
|
+
| Dimension | Training | Serving |
|
|
20
|
+
|-----------|----------|---------|
|
|
21
|
+
| Throughput | High (process TB of data) | High (handle thousands of RPS) |
|
|
22
|
+
| Latency | Not time-critical | P99 < 200ms |
|
|
23
|
+
| Consistency | Best-effort | Exact reproducibility |
|
|
24
|
+
| Infrastructure | Spot instances, GPU clusters | Reserved instances, autoscaling |
|
|
25
|
+
| State | Stateful (checkpoint, resume) | Stateless (each request independent) |
|
|
26
|
+
| Failures | Retry / resume from checkpoint | Circuit breaker, fallback |
|
|
27
|
+
|
|
28
|
+
The primary risk of this split is **training-serving skew**: the model was trained with feature X computed one way, but production computes feature X a slightly different way, leading to silent accuracy degradation.
|
|
29
|
+
|
|
30
|
+
**Prevention strategies**:
|
|
31
|
+
1. **Shared feature library**: Both training and serving import the same `src/features/` module for all feature computation. Never duplicate feature logic.
|
|
32
|
+
2. **Feature store**: Centralised store that computes features once and serves them to both training (historical) and serving (real-time) paths.
|
|
33
|
+
3. **Schema validation**: Validate model inputs at serving time against the schema seen during training.
|
|
34
|
+
|
|
35
|
+
### Feature Store Architecture
|
|
36
|
+
|
|
37
|
+
A feature store is a data infrastructure component that stores and serves pre-computed features:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
┌─────────────────┐
|
|
41
|
+
Data Sources ──────►│ Feature Pipeline │
|
|
42
|
+
└────────┬────────┘
|
|
43
|
+
│ compute features
|
|
44
|
+
┌────────▼────────┐
|
|
45
|
+
│ Feature Store │
|
|
46
|
+
│ ┌───────────┐ │
|
|
47
|
+
│ │ Offline │ │──► Training Data
|
|
48
|
+
│ │ (S3/GCS) │ │
|
|
49
|
+
│ ├───────────┤ │
|
|
50
|
+
│ │ Online │ │──► Serving (low-latency)
|
|
51
|
+
│ │ (Redis) │ │
|
|
52
|
+
└─────────────────┘
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**Offline store**: Historical features for training. Backed by object storage (S3, GCS) or a columnar database (BigQuery, Snowflake). Supports point-in-time correct feature retrieval (no data leakage).
|
|
56
|
+
|
|
57
|
+
**Online store**: Low-latency feature serving for real-time inference. Backed by Redis, DynamoDB, or Cassandra. Stores only the latest feature values.
|
|
58
|
+
|
|
59
|
+
**Implementations**: Feast (open source), Tecton, Vertex AI Feature Store, SageMaker Feature Store.
|
|
60
|
+
|
|
61
|
+
A feature store is justified when:
|
|
62
|
+
- Multiple models use the same features (DRY for features)
|
|
63
|
+
- Training-serving skew is causing production issues
|
|
64
|
+
- Feature computation is expensive and should be shared
|
|
65
|
+
|
|
66
|
+
### Model Registry
|
|
67
|
+
|
|
68
|
+
The model registry is the governance layer between training and production. Every model that has ever been trained should have a registry record:
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
Training Run ──► Model Artifacts ──► Registry ──► Deployment
|
|
72
|
+
(weights, config) (metadata, (staging,
|
|
73
|
+
lineage) production)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**Registry metadata per model version**:
|
|
77
|
+
- Model name and semantic version (`fraud-detector-v2.3.1`)
|
|
78
|
+
- Training run ID and commit SHA (full lineage)
|
|
79
|
+
- Training metrics (AUC, F1, etc.)
|
|
80
|
+
- Evaluation metrics on holdout sets
|
|
81
|
+
- Training dataset version
|
|
82
|
+
- Model schema (input/output feature names and types)
|
|
83
|
+
- Serving requirements (runtime, memory, GPU)
|
|
84
|
+
- Promotion history (who promoted, when, why)
|
|
85
|
+
|
|
86
|
+
**Lifecycle stages**:
|
|
87
|
+
```
|
|
88
|
+
None → Staging → Production → Archived
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
- Validation gates control promotion: automated tests (accuracy thresholds, latency budgets) plus optional human approval
|
|
92
|
+
- Production always has at least one previous version for rollback
|
|
93
|
+
- Archive on a schedule (keep 6 months of production versions)
|
|
94
|
+
|
|
95
|
+
**Implementations**: MLflow Model Registry, Weights & Biases Model Registry, SageMaker Model Registry.
|
|
96
|
+
|
|
97
|
+
### Online vs. Offline Inference
|
|
98
|
+
|
|
99
|
+
**Online inference** (real-time, synchronous):
|
|
100
|
+
- Model returns predictions in response to a request, typically within 100–500ms
|
|
101
|
+
- Examples: fraud scoring at checkout, recommendation on page load, search ranking
|
|
102
|
+
- Infrastructure: Model server (TorchServe, Triton, BentoML), autoscaled behind a load balancer
|
|
103
|
+
- Key considerations: latency budget, model size (must fit in serving memory), cold start time
|
|
104
|
+
|
|
105
|
+
**Offline inference** (batch, asynchronous):
|
|
106
|
+
- Model scores a large dataset on a schedule, stores predictions for later retrieval
|
|
107
|
+
- Examples: churn prediction for all users (run nightly), content pre-scoring for a recommendation cache
|
|
108
|
+
- Infrastructure: Spark job, Airflow DAG, Ray cluster, or simple Python script on a large VM
|
|
109
|
+
- Key considerations: throughput (records/second), data pipeline integration, prediction freshness
|
|
110
|
+
|
|
111
|
+
**Near-real-time / stream inference**:
|
|
112
|
+
- Model scores events from a stream (Kafka, Kinesis) with seconds-to-minutes latency
|
|
113
|
+
- Examples: anomaly detection on clickstream, session-level personalisation
|
|
114
|
+
- Infrastructure: Kafka consumer + model inference worker, Flink ML, or Spark Structured Streaming
|
|
115
|
+
- Key considerations: exactly-once semantics, ordering guarantees, backpressure handling
|
|
116
|
+
|
|
117
|
+
**Decision matrix**:
|
|
118
|
+
|
|
119
|
+
| Use Case | Latency Requirement | Data Freshness | Approach |
|
|
120
|
+
|----------|---------------------|----------------|----------|
|
|
121
|
+
| Checkout fraud | < 500ms | Real-time | Online inference |
|
|
122
|
+
| Churn prediction | N/A | Daily | Offline batch |
|
|
123
|
+
| Email personalisation | N/A (send time) | Daily | Offline batch |
|
|
124
|
+
| Feed ranking | < 200ms | Real-time | Online inference |
|
|
125
|
+
| Anomaly detection | < 30 seconds | Streaming | Stream inference |
|
|
126
|
+
|
|
127
|
+
### ML System Design Patterns
|
|
128
|
+
|
|
129
|
+
**Lambda Architecture for ML**:
|
|
130
|
+
- Batch layer: nightly model retraining or batch scoring
|
|
131
|
+
- Speed layer: real-time model updates or online scoring
|
|
132
|
+
- Serving layer: unified API serving precomputed (batch) + real-time predictions
|
|
133
|
+
- Complexity cost is high — evaluate whether the freshness gain justifies it
|
|
134
|
+
|
|
135
|
+
**Two-Tower Architecture** (recommendation systems):
|
|
136
|
+
- Candidate generation tower: fast approximate nearest-neighbour retrieval (ANN index)
|
|
137
|
+
- Ranking tower: expensive full model scoring of top-K candidates
|
|
138
|
+
- Separates recall (retrieve thousands of candidates quickly) from precision (rank them accurately)
|
|
139
|
+
|
|
140
|
+
**Shadow Mode Deployment**:
|
|
141
|
+
- New model runs in parallel with production model, receiving real traffic
|
|
142
|
+
- New model's predictions are not served but are logged and evaluated
|
|
143
|
+
- Safe way to validate new models with real data before full deployment
|
|
144
|
+
|
|
145
|
+
### Architecture Decision Record Template for ML
|
|
146
|
+
|
|
147
|
+
```markdown
|
|
148
|
+
# ADR-ML-001: Online vs. Offline Inference for Recommendation
|
|
149
|
+
|
|
150
|
+
## Status
|
|
151
|
+
Accepted — 2024-03-15
|
|
152
|
+
|
|
153
|
+
## Context
|
|
154
|
+
Product recommends items to users on homepage. 5M daily active users.
|
|
155
|
+
Data team produces updated user embeddings daily. Item catalog updates hourly.
|
|
156
|
+
|
|
157
|
+
## Decision
|
|
158
|
+
Use offline batch inference for recommendation.
|
|
159
|
+
- Nightly batch job scores all user-item pairs for top 100 candidates
|
|
160
|
+
- Recommendations stored in Redis with TTL of 26 hours
|
|
161
|
+
- API reads from Redis — zero model inference at request time
|
|
162
|
+
|
|
163
|
+
## Consequences
|
|
164
|
+
- P99 homepage latency drops from 450ms to 20ms (Redis lookup vs. model inference)
|
|
165
|
+
- Recommendations are up to 26 hours stale — acceptable given content update frequency
|
|
166
|
+
- Requires Redis cluster (operational cost)
|
|
167
|
+
- Model update cycle is daily — cannot react to same-session behaviour
|
|
168
|
+
|
|
169
|
+
## Alternatives Rejected
|
|
170
|
+
- Online inference: latency budget exceeded (model P99 = 380ms)
|
|
171
|
+
- Streaming inference: engineering complexity not justified given daily embedding update cadence
|
|
172
|
+
```
|