@zigrivers/scaffold 3.8.0 → 3.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +75 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +167 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,406 @@
1
+ ---
2
+ name: data-pipeline-testing
3
+ description: Unit tests for transforms, integration tests, data quality tests, and performance tests for data pipelines
4
+ topics: [data-pipeline, testing, unit-tests, integration-tests, data-quality-tests, performance-tests, tdd, pytest]
5
+ ---
6
+
7
+ Data pipeline testing requires a layered strategy: fast unit tests for transformation logic, integration tests against containerized infrastructure, data quality tests that run inline in production, and performance tests to catch throughput regressions. The most common testing gap in pipeline projects is that DAG files and orchestration logic are tested but transformation functions — the business logic — are not. Transformation functions must be the most heavily tested component because correctness of results depends entirely on them.
8
+
9
+ ## Summary
10
+
11
+ Test transformation functions as pure functions with unit tests: no infrastructure, no side effects, fast execution. Test pipeline end-to-end integration against containerized Kafka, Postgres, and object storage. Run data quality tests inline as pipeline gates using Great Expectations or dbt tests. Test performance against a realistic data volume to catch throughput regressions before they reach production. Target 90%+ coverage on `src/transforms/`.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Unit Tests for Transforms
16
+
17
+ Transformation functions are pure functions — they take data in, return data out. This makes them trivially unit-testable:
18
+
19
+ ```python
20
+ # src/transforms/payments/normalize.py
21
+ def normalize_transaction_amounts(records: list[dict]) -> list[dict]:
22
+ """Normalize all transaction amounts to USD using embedded exchange rates."""
23
+ result = []
24
+ for record in records:
25
+ normalized = {**record}
26
+ normalized["amount_usd"] = convert_to_usd(
27
+ record["amount"],
28
+ record.get("currency", "USD"),
29
+ )
30
+ result.append(normalized)
31
+ return result
32
+ ```
33
+
34
+ ```python
35
+ # tests/unit/transforms/payments/test_normalize.py
36
+ import pytest
37
+ from src.transforms.payments.normalize import normalize_transaction_amounts
38
+
39
+ class TestNormalizeTransactionAmounts:
40
+ def test_usd_passthrough(self):
41
+ """USD transactions are not converted."""
42
+ records = [{"amount": 100.0, "currency": "USD"}]
43
+ result = normalize_transaction_amounts(records)
44
+ assert result[0]["amount_usd"] == 100.0
45
+
46
+ def test_eur_conversion(self):
47
+ """EUR amounts are converted to USD."""
48
+ records = [{"amount": 100.0, "currency": "EUR"}]
49
+ result = normalize_transaction_amounts(records)
50
+ assert result[0]["amount_usd"] > 0
51
+ assert result[0]["amount_usd"] != 100.0 # was converted
52
+
53
+ def test_missing_currency_defaults_to_usd(self):
54
+ """Records without currency field default to USD."""
55
+ records = [{"amount": 50.0}]
56
+ result = normalize_transaction_amounts(records)
57
+ assert result[0]["amount_usd"] == 50.0
58
+
59
+ def test_original_record_unchanged(self):
60
+ """Transform does not mutate input records."""
61
+ original = {"amount": 100.0, "currency": "EUR"}
62
+ records = [original]
63
+ normalize_transaction_amounts(records)
64
+ assert original == {"amount": 100.0, "currency": "EUR"}
65
+
66
+ def test_empty_input_returns_empty(self):
67
+ """Empty list returns empty list."""
68
+ assert normalize_transaction_amounts([]) == []
69
+
70
+ def test_preserves_all_original_fields(self):
71
+ """Additional fields on input records are preserved in output."""
72
+ records = [{"amount": 100.0, "currency": "USD", "transaction_id": "txn_abc"}]
73
+ result = normalize_transaction_amounts(records)
74
+ assert result[0]["transaction_id"] == "txn_abc"
75
+
76
+ @pytest.mark.parametrize("amount,currency,expected_range", [
77
+ (0.0, "USD", (0.0, 0.0)),
78
+ (1_000_000.0, "USD", (1_000_000.0, 1_000_000.0)),
79
+ (-1.0, "USD", (-1.0, -1.0)), # negative amounts should be preserved for refunds
80
+ ])
81
+ def test_boundary_amounts(self, amount, currency, expected_range):
82
+ """Test boundary amount values."""
83
+ records = [{"amount": amount, "currency": currency}]
84
+ result = normalize_transaction_amounts(records)
85
+ low, high = expected_range
86
+ assert low <= result[0]["amount_usd"] <= high
87
+ ```
88
+
89
+ **Transform test patterns**
90
+
91
+ Test these scenarios for every transform function:
92
+ 1. Happy path: valid input produces expected output
93
+ 2. Empty input: empty list returns empty list
94
+ 3. Null/missing optional fields: handled gracefully with documented defaults
95
+ 4. Null/missing required fields: raises expected exception
96
+ 5. Boundary values: zero, negative, very large numbers
97
+ 6. Input immutability: input records are not mutated
98
+ 7. Edge cases specific to the business domain
99
+
100
+ **Fixture-based tests**
101
+
102
+ For complex transformations, use fixture files:
103
+
104
+ ```python
105
+ # tests/unit/transforms/payments/test_aggregate.py
106
+ import json
107
+ from pathlib import Path
108
+ from src.transforms.payments.aggregate import compute_daily_revenue
109
+
110
+ def load_fixture(name: str) -> list[dict]:
111
+ path = Path(__file__).parent.parent.parent / "fixtures" / name
112
+ return json.loads(path.read_text())
113
+
114
+ def test_daily_revenue_aggregation():
115
+ """Daily revenue correctly sums transactions by currency."""
116
+ transactions = load_fixture("payments/raw/valid_transactions.json")
117
+ expected = load_fixture("payments/expected/aggregated_revenue.json")
118
+
119
+ result = compute_daily_revenue(transactions, date="2024-01-15")
120
+
121
+ assert result["date"] == "2024-01-15"
122
+ assert result["total_usd"] == pytest.approx(expected["total_usd"], rel=0.001)
123
+ assert result["transaction_count"] == expected["transaction_count"]
124
+ ```
125
+
126
+ ### Unit Tests for Sources and Sinks
127
+
128
+ Sources and sinks interact with external systems. Test them using mocks or test doubles:
129
+
130
+ ```python
131
+ # tests/unit/sources/test_stripe_reader.py
132
+ from unittest.mock import MagicMock, patch
133
+ from src.sources.stripe.charges import StripeChargesReader
134
+ from datetime import datetime, timezone
135
+
136
+ class TestStripeChargesReader:
137
+ @patch("src.sources.stripe.charges.stripe.Charge.list")
138
+ def test_reads_charges_for_time_window(self, mock_list):
139
+ """Reader requests charges within the specified time window."""
140
+ mock_list.return_value = MagicMock(
141
+ auto_paging_iter=lambda: iter([
142
+ {"id": "ch_001", "amount": 1000, "currency": "usd", "created": 1705329825},
143
+ ])
144
+ )
145
+
146
+ reader = StripeChargesReader(api_key="sk_test_fake")
147
+ start = datetime(2024, 1, 15, 0, 0, tzinfo=timezone.utc)
148
+ end = datetime(2024, 1, 15, 1, 0, tzinfo=timezone.utc)
149
+
150
+ records = list(reader.read(start, end))
151
+
152
+ mock_list.assert_called_once_with(
153
+ created={"gte": int(start.timestamp()), "lt": int(end.timestamp())},
154
+ limit=100,
155
+ )
156
+ assert len(records) == 1
157
+ assert records[0]["id"] == "ch_001"
158
+
159
+ @patch("src.sources.stripe.charges.stripe.Charge.list")
160
+ def test_handles_rate_limit_with_retry(self, mock_list):
161
+ """Reader retries on rate limit errors."""
162
+ mock_list.side_effect = [
163
+ stripe.error.RateLimitError("Too many requests"),
164
+ MagicMock(auto_paging_iter=lambda: iter([])),
165
+ ]
166
+
167
+ reader = StripeChargesReader(api_key="sk_test_fake", max_retries=2)
168
+ records = list(reader.read(datetime.utcnow(), datetime.utcnow()))
169
+
170
+ assert mock_list.call_count == 2 # retried once
171
+ ```
172
+
173
+ ### Integration Tests
174
+
175
+ Integration tests run the full pipeline end-to-end against real (containerized) infrastructure:
176
+
177
+ ```python
178
+ # tests/integration/test_payments_pipeline.py
179
+ import pytest
180
+ import docker
181
+ from datetime import datetime
182
+ from src.pipelines.payments import PaymentsIngestionPipeline
183
+
184
+ @pytest.fixture(scope="session")
185
+ def docker_services(docker_ip, docker_services):
186
+ """Start required Docker services for integration tests."""
187
+ docker_services.start("kafka")
188
+ docker_services.start("postgres")
189
+ docker_services.wait_until_responsive(
190
+ timeout=60.0,
191
+ pause=0.1,
192
+ check=lambda: is_responsive("localhost", 9092),
193
+ )
194
+ return docker_services
195
+
196
+ @pytest.fixture
197
+ def pipeline(docker_services):
198
+ """Create pipeline instance connected to test containers."""
199
+ return PaymentsIngestionPipeline(
200
+ source_config={"host": "localhost", "port": 5432, "db": "test_db"},
201
+ sink_config={"bucket": "test-silver", "endpoint": "http://localhost:9000"},
202
+ )
203
+
204
+ def test_pipeline_processes_valid_transactions(pipeline, test_transactions):
205
+ """Integration test: pipeline reads, transforms, and writes transactions."""
206
+ # Seed source data
207
+ seed_postgres(test_transactions)
208
+
209
+ # Run pipeline for test window
210
+ result = pipeline.run(
211
+ start=datetime(2024, 1, 15, 0, 0),
212
+ end=datetime(2024, 1, 15, 1, 0),
213
+ )
214
+
215
+ # Verify output
216
+ output = read_from_minio("test-silver", "transactions/2024-01-15/00/")
217
+ assert len(output) == len(test_transactions)
218
+ assert all("amount_usd" in record for record in output)
219
+ assert result.records_written == len(test_transactions)
220
+ assert result.dlq_records == 0
221
+
222
+ def test_pipeline_routes_invalid_records_to_dlq(pipeline):
223
+ """Integration test: invalid records go to DLQ, valid records proceed."""
224
+ mixed_records = [
225
+ {"transaction_id": "txn_001", "amount": 100.0, "currency": "USD", ...},
226
+ {"transaction_id": "txn_002", "amount": "N/A", "currency": "INVALID", ...}, # bad
227
+ {"transaction_id": "txn_003", "amount": 50.0, "currency": "EUR", ...},
228
+ ]
229
+ seed_postgres(mixed_records)
230
+
231
+ result = pipeline.run(
232
+ start=datetime(2024, 1, 15, 0, 0),
233
+ end=datetime(2024, 1, 15, 1, 0),
234
+ )
235
+
236
+ assert result.records_written == 2 # valid records
237
+ assert result.dlq_records == 1 # bad record in DLQ
238
+ ```
239
+
240
+ **docker-compose test configuration**
241
+
242
+ ```yaml
243
+ # docker/docker-compose.test.yml
244
+ version: "3.9"
245
+ services:
246
+ kafka:
247
+ image: confluentinc/cp-kafka:7.5.0
248
+ ports: ["9092:9092"]
249
+ environment:
250
+ KAFKA_BROKER_ID: 1
251
+ KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
252
+ KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
253
+ KAFKA_LOG_RETENTION_MS: 60000 # short retention for tests
254
+ postgres:
255
+ image: postgres:15
256
+ ports: ["5432:5432"]
257
+ environment:
258
+ POSTGRES_PASSWORD: test
259
+ POSTGRES_DB: test_db
260
+ minio:
261
+ image: minio/minio:latest
262
+ ports: ["9000:9000"]
263
+ command: server /data
264
+ environment:
265
+ MINIO_ROOT_USER: test
266
+ MINIO_ROOT_PASSWORD: testtest
267
+ ```
268
+
269
+ ### Data Quality Tests
270
+
271
+ Data quality tests run inline in the pipeline and also as standalone test suite:
272
+
273
+ ```python
274
+ # tests/quality/test_silver_transactions_quality.py
275
+ import pytest
276
+ import great_expectations as gx
277
+ from tests.fixtures import load_fixture
278
+
279
+ def test_silver_transaction_completeness():
280
+ """All required fields are present and non-null."""
281
+ df = load_fixture_as_dataframe("payments/silver/valid_transactions.parquet")
282
+ suite = load_expectation_suite("silver_transactions")
283
+ validator = gx.Validator(batch=df, expectation_suite=suite)
284
+
285
+ results = validator.validate()
286
+ failed = [r for r in results.results if not r.success]
287
+
288
+ assert not failed, f"Quality checks failed:\n" + "\n".join(str(f) for f in failed)
289
+
290
+ def test_no_duplicate_transaction_ids():
291
+ """Transaction IDs are unique in silver layer."""
292
+ df = load_fixture_as_dataframe("payments/silver/valid_transactions.parquet")
293
+ duplicate_count = df["transaction_id"].duplicated().sum()
294
+ assert duplicate_count == 0, f"Found {duplicate_count} duplicate transaction IDs"
295
+
296
+ def test_amount_within_bounds():
297
+ """All amounts are non-negative and below maximum threshold."""
298
+ df = load_fixture_as_dataframe("payments/silver/valid_transactions.parquet")
299
+ assert (df["amount"] >= 0).all(), "Negative amounts found"
300
+ assert (df["amount"] <= 1_000_000).all(), "Amounts exceeding maximum found"
301
+ ```
302
+
303
+ ### Performance Tests
304
+
305
+ Performance tests catch throughput regressions before they reach production:
306
+
307
+ ```python
308
+ # tests/performance/test_transform_performance.py
309
+ import time
310
+ import pytest
311
+ from src.transforms.payments.normalize import normalize_transaction_amounts
312
+ from tests.fixtures import generate_transactions
313
+
314
+ @pytest.mark.performance
315
+ class TestNormalizePerformance:
316
+ RECORDS = 100_000
317
+ MAX_SECONDS = 10.0 # must normalize 100K records in under 10 seconds
318
+
319
+ def test_throughput(self):
320
+ """normalize_transaction_amounts processes 100K records in under 10 seconds."""
321
+ records = generate_transactions(self.RECORDS)
322
+
323
+ start = time.perf_counter()
324
+ result = normalize_transaction_amounts(records)
325
+ elapsed = time.perf_counter() - start
326
+
327
+ assert len(result) == self.RECORDS
328
+ assert elapsed < self.MAX_SECONDS, (
329
+ f"Performance regression: normalized {self.RECORDS} records in {elapsed:.2f}s "
330
+ f"(max: {self.MAX_SECONDS}s)"
331
+ )
332
+
333
+ def test_memory_usage(self):
334
+ """Transform does not exceed 500MB memory for 100K records."""
335
+ import tracemalloc
336
+ tracemalloc.start()
337
+
338
+ records = generate_transactions(self.RECORDS)
339
+ normalize_transaction_amounts(records)
340
+
341
+ current, peak = tracemalloc.get_traced_memory()
342
+ tracemalloc.stop()
343
+
344
+ peak_mb = peak / 1024 / 1024
345
+ assert peak_mb < 500, f"Memory usage {peak_mb:.1f}MB exceeds 500MB limit"
346
+ ```
347
+
348
+ **Performance test benchmarks**
349
+
350
+ Run benchmarks with pytest-benchmark to track performance over time:
351
+
352
+ ```python
353
+ def test_normalize_benchmark(benchmark):
354
+ """Benchmark normalization transform for regression tracking."""
355
+ records = generate_transactions(10_000)
356
+ result = benchmark(normalize_transaction_amounts, records)
357
+ assert len(result) == 10_000
358
+ ```
359
+
360
+ ### Test Coverage and CI Configuration
361
+
362
+ ```ini
363
+ # pyproject.toml
364
+ [tool.pytest.ini_options]
365
+ testpaths = ["tests"]
366
+ markers = [
367
+ "unit: fast unit tests",
368
+ "integration: requires Docker services",
369
+ "performance: long-running performance tests",
370
+ "quality: data quality assertion tests",
371
+ ]
372
+ addopts = "--strict-markers"
373
+
374
+ [tool.coverage.run]
375
+ source = ["src"]
376
+ branch = true
377
+ omit = ["src/dags/*"] # DAG files excluded; coverage focused on transforms
378
+
379
+ [tool.coverage.report]
380
+ fail_under = 90
381
+ show_missing = true
382
+ ```
383
+
384
+ ```yaml
385
+ # .github/workflows/test.yml excerpt
386
+ jobs:
387
+ unit-tests:
388
+ runs-on: ubuntu-latest
389
+ steps:
390
+ - run: pytest tests/unit -v --cov=src/transforms --cov-fail-under=90
391
+
392
+ integration-tests:
393
+ runs-on: ubuntu-latest
394
+ steps:
395
+ - run: docker compose -f docker/docker-compose.test.yml up -d
396
+ - run: pytest tests/integration -v -m integration
397
+ - run: docker compose -f docker/docker-compose.test.yml down
398
+ ```
399
+
400
+ ### Test Data Management Rules
401
+
402
+ 1. Never use production data in tests — generate synthetic data or use anonymized fixtures
403
+ 2. Test fixtures must be deterministic — no random seeds without explicit seeding
404
+ 3. Integration test databases must be isolated per test run — use unique schema or database names
405
+ 4. Performance tests must run on the same hardware class as CI to produce comparable benchmarks
406
+ 5. DLQ assertions are mandatory: every integration test must assert the expected DLQ record count, not just the success path
@@ -0,0 +1,172 @@
1
+ ---
2
+ name: ml-architecture
3
+ description: Training/serving architecture split, feature stores, model registry, online vs offline inference patterns, and ML system design decisions
4
+ topics: [ml, architecture, feature-store, model-registry, inference, serving, training]
5
+ ---
6
+
7
+ ML systems have a fundamental architectural split that traditional software does not: the training system and the serving system are different codebases running on different infrastructure, yet they must agree on the exact same data transformations. This training-serving skew is the most common source of silent production bugs in ML. Designing the architecture to prevent skew — through shared feature stores, shared preprocessing libraries, and strict interface contracts — is the most important ML architecture decision.
8
+
9
+ ## Summary
10
+
11
+ ML architecture separates training (batch, data-intensive, experimental) from serving (latency-sensitive, stateless, reliable). Feature stores eliminate training-serving skew by providing consistent feature computation for both. Model registries provide the governance layer: versioning, lineage, deployment gates, and rollback. Choose online vs. offline inference based on latency requirements and data freshness needs. Document every architectural decision as an ADR.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Training/Serving Architecture Split
16
+
17
+ The training system and serving system have fundamentally different requirements:
18
+
19
+ | Dimension | Training | Serving |
20
+ |-----------|----------|---------|
21
+ | Throughput | High (process TB of data) | High (handle thousands of RPS) |
22
+ | Latency | Not time-critical | P99 < 200ms |
23
+ | Consistency | Best-effort | Exact reproducibility |
24
+ | Infrastructure | Spot instances, GPU clusters | Reserved instances, autoscaling |
25
+ | State | Stateful (checkpoint, resume) | Stateless (each request independent) |
26
+ | Failures | Retry / resume from checkpoint | Circuit breaker, fallback |
27
+
28
+ The primary risk of this split is **training-serving skew**: the model was trained with feature X computed one way, but production computes feature X a slightly different way, leading to silent accuracy degradation.
29
+
30
+ **Prevention strategies**:
31
+ 1. **Shared feature library**: Both training and serving import the same `src/features/` module for all feature computation. Never duplicate feature logic.
32
+ 2. **Feature store**: Centralised store that computes features once and serves them to both training (historical) and serving (real-time) paths.
33
+ 3. **Schema validation**: Validate model inputs at serving time against the schema seen during training.
34
+
35
+ ### Feature Store Architecture
36
+
37
+ A feature store is a data infrastructure component that stores and serves pre-computed features:
38
+
39
+ ```
40
+ ┌─────────────────┐
41
+ Data Sources ──────►│ Feature Pipeline │
42
+ └────────┬────────┘
43
+ │ compute features
44
+ ┌────────▼────────┐
45
+ │ Feature Store │
46
+ │ ┌───────────┐ │
47
+ │ │ Offline │ │──► Training Data
48
+ │ │ (S3/GCS) │ │
49
+ │ ├───────────┤ │
50
+ │ │ Online │ │──► Serving (low-latency)
51
+ │ │ (Redis) │ │
52
+ └─────────────────┘
53
+ ```
54
+
55
+ **Offline store**: Historical features for training. Backed by object storage (S3, GCS) or a columnar database (BigQuery, Snowflake). Supports point-in-time correct feature retrieval (no data leakage).
56
+
57
+ **Online store**: Low-latency feature serving for real-time inference. Backed by Redis, DynamoDB, or Cassandra. Stores only the latest feature values.
58
+
59
+ **Implementations**: Feast (open source), Tecton, Vertex AI Feature Store, SageMaker Feature Store.
60
+
61
+ A feature store is justified when:
62
+ - Multiple models use the same features (DRY for features)
63
+ - Training-serving skew is causing production issues
64
+ - Feature computation is expensive and should be shared
65
+
66
+ ### Model Registry
67
+
68
+ The model registry is the governance layer between training and production. Every model that has ever been trained should have a registry record:
69
+
70
+ ```
71
+ Training Run ──► Model Artifacts ──► Registry ──► Deployment
72
+ (weights, config) (metadata, (staging,
73
+ lineage) production)
74
+ ```
75
+
76
+ **Registry metadata per model version**:
77
+ - Model name and semantic version (`fraud-detector-v2.3.1`)
78
+ - Training run ID and commit SHA (full lineage)
79
+ - Training metrics (AUC, F1, etc.)
80
+ - Evaluation metrics on holdout sets
81
+ - Training dataset version
82
+ - Model schema (input/output feature names and types)
83
+ - Serving requirements (runtime, memory, GPU)
84
+ - Promotion history (who promoted, when, why)
85
+
86
+ **Lifecycle stages**:
87
+ ```
88
+ None → Staging → Production → Archived
89
+ ```
90
+
91
+ - Validation gates control promotion: automated tests (accuracy thresholds, latency budgets) plus optional human approval
92
+ - Production always has at least one previous version for rollback
93
+ - Archive on a schedule (keep 6 months of production versions)
94
+
95
+ **Implementations**: MLflow Model Registry, Weights & Biases Model Registry, SageMaker Model Registry.
96
+
97
+ ### Online vs. Offline Inference
98
+
99
+ **Online inference** (real-time, synchronous):
100
+ - Model returns predictions in response to a request, typically within 100–500ms
101
+ - Examples: fraud scoring at checkout, recommendation on page load, search ranking
102
+ - Infrastructure: Model server (TorchServe, Triton, BentoML), autoscaled behind a load balancer
103
+ - Key considerations: latency budget, model size (must fit in serving memory), cold start time
104
+
105
+ **Offline inference** (batch, asynchronous):
106
+ - Model scores a large dataset on a schedule, stores predictions for later retrieval
107
+ - Examples: churn prediction for all users (run nightly), content pre-scoring for a recommendation cache
108
+ - Infrastructure: Spark job, Airflow DAG, Ray cluster, or simple Python script on a large VM
109
+ - Key considerations: throughput (records/second), data pipeline integration, prediction freshness
110
+
111
+ **Near-real-time / stream inference**:
112
+ - Model scores events from a stream (Kafka, Kinesis) with seconds-to-minutes latency
113
+ - Examples: anomaly detection on clickstream, session-level personalisation
114
+ - Infrastructure: Kafka consumer + model inference worker, Flink ML, or Spark Structured Streaming
115
+ - Key considerations: exactly-once semantics, ordering guarantees, backpressure handling
116
+
117
+ **Decision matrix**:
118
+
119
+ | Use Case | Latency Requirement | Data Freshness | Approach |
120
+ |----------|---------------------|----------------|----------|
121
+ | Checkout fraud | < 500ms | Real-time | Online inference |
122
+ | Churn prediction | N/A | Daily | Offline batch |
123
+ | Email personalisation | N/A (send time) | Daily | Offline batch |
124
+ | Feed ranking | < 200ms | Real-time | Online inference |
125
+ | Anomaly detection | < 30 seconds | Streaming | Stream inference |
126
+
127
+ ### ML System Design Patterns
128
+
129
+ **Lambda Architecture for ML**:
130
+ - Batch layer: nightly model retraining or batch scoring
131
+ - Speed layer: real-time model updates or online scoring
132
+ - Serving layer: unified API serving precomputed (batch) + real-time predictions
133
+ - Complexity cost is high — evaluate whether the freshness gain justifies it
134
+
135
+ **Two-Tower Architecture** (recommendation systems):
136
+ - Candidate generation tower: fast approximate nearest-neighbour retrieval (ANN index)
137
+ - Ranking tower: expensive full model scoring of top-K candidates
138
+ - Separates recall (retrieve thousands of candidates quickly) from precision (rank them accurately)
139
+
140
+ **Shadow Mode Deployment**:
141
+ - New model runs in parallel with production model, receiving real traffic
142
+ - New model's predictions are not served but are logged and evaluated
143
+ - Safe way to validate new models with real data before full deployment
144
+
145
+ ### Architecture Decision Record Template for ML
146
+
147
+ ```markdown
148
+ # ADR-ML-001: Online vs. Offline Inference for Recommendation
149
+
150
+ ## Status
151
+ Accepted — 2024-03-15
152
+
153
+ ## Context
154
+ Product recommends items to users on homepage. 5M daily active users.
155
+ Data team produces updated user embeddings daily. Item catalog updates hourly.
156
+
157
+ ## Decision
158
+ Use offline batch inference for recommendation.
159
+ - Nightly batch job scores all user-item pairs for top 100 candidates
160
+ - Recommendations stored in Redis with TTL of 26 hours
161
+ - API reads from Redis — zero model inference at request time
162
+
163
+ ## Consequences
164
+ - P99 homepage latency drops from 450ms to 20ms (Redis lookup vs. model inference)
165
+ - Recommendations are up to 26 hours stale — acceptable given content update frequency
166
+ - Requires Redis cluster (operational cost)
167
+ - Model update cycle is daily — cannot react to same-session behaviour
168
+
169
+ ## Alternatives Rejected
170
+ - Online inference: latency budget exceeded (model P99 = 380ms)
171
+ - Streaming inference: engineering complexity not justified given daily embedding update cadence
172
+ ```