@zigrivers/scaffold 3.8.0 → 3.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +75 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +167 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,145 @@
1
+ ---
2
+ name: data-pipeline-requirements
3
+ description: SLA requirements (latency, throughput, freshness), data quality budgets, and regulatory compliance (PII, GDPR) for data pipelines
4
+ topics: [data-pipeline, requirements, sla, latency, throughput, freshness, data-quality, pii, gdpr, compliance]
5
+ ---
6
+
7
+ Data pipeline requirements must be defined in measurable terms before any architecture decision is made. Vague requirements like "fast" or "reliable" translate into untestable systems. Every pipeline needs explicit SLAs for latency, throughput, and data freshness — and a defined budget for data quality violations and regulatory compliance obligations.
8
+
9
+ ## Summary
10
+
11
+ Define pipeline SLAs in concrete numbers: end-to-end latency targets (e.g., p99 < 5 minutes), throughput capacity (e.g., 500K events/sec peak), and data freshness windows (e.g., reports no older than 1 hour). Set explicit data quality budgets — allowable error rates, null tolerances, and schema drift thresholds. Identify all PII fields and their regulatory classification (GDPR, CCPA, HIPAA) before building any pipeline that touches them.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Latency SLA Definition
16
+
17
+ Latency in data pipelines is measured end-to-end: from event occurrence to data availability in the serving layer. Define latency at percentiles, not averages:
18
+
19
+ - **p50 (median)**: Typical processing time under normal load
20
+ - **p95**: The target most operations should hit
21
+ - **p99**: The tail latency SLA — what is acceptable for 1-in-100 operations
22
+ - **p999**: What happens in worst-case bursts
23
+
24
+ Example SLA table:
25
+
26
+ | Pipeline Type | p50 Target | p99 Target | Max Acceptable Lag |
27
+ |---------------|------------|------------|-------------------|
28
+ | Fraud detection | < 500ms | < 2s | 10s |
29
+ | Operational reporting | < 5min | < 15min | 1hr |
30
+ | Analytics warehouse | < 1hr | < 4hr | 24hr |
31
+ | Compliance audit logs | < 15min | < 1hr | 4hr |
32
+
33
+ Latency SLAs drive architecture selection. A p99 < 1 second requirement rules out batch processing and mandates streaming. A p99 < 1 hour permits micro-batch or even hourly cron jobs. Match the architecture to the SLA before writing any code.
34
+
35
+ ### Throughput and Capacity Planning
36
+
37
+ Throughput requirements must account for peak load, not average load. Data pipelines routinely experience 10x to 100x spikes at business events (end of month, flash sales, marketing campaigns):
38
+
39
+ - **Steady-state throughput**: Average events or records per second under normal operation
40
+ - **Peak throughput**: Maximum burst the system must handle without degradation
41
+ - **Sustained peak duration**: How long the peak must be sustained before scaling kicks in
42
+ - **Data volume growth rate**: Year-over-year growth trajectory for capacity planning
43
+
44
+ Document throughput in multiple units because different layers care about different metrics:
45
+ - Ingestion layer: events/second, bytes/second
46
+ - Processing layer: records/second, transformations/second
47
+ - Storage layer: GB/day ingested, TB total, compression ratio expected
48
+
49
+ Capacity plan to handle 3x current peak to avoid emergency scaling. If current peak is 100K events/sec, design for 300K.
50
+
51
+ ### Data Freshness Requirements
52
+
53
+ Data freshness (or staleness tolerance) defines how old data can be before it loses business value. Freshness requirements vary dramatically by use case:
54
+
55
+ - **Real-time operational decisions** (fraud, recommendations): seconds to minutes of acceptable staleness
56
+ - **Operational dashboards** (business KPIs, support queues): minutes to one hour
57
+ - **Analytical reporting** (daily/weekly reports): hours to 24 hours
58
+ - **Historical analysis and ML training**: days are acceptable
59
+
60
+ Freshness SLAs drive pipeline scheduling and trigger mechanisms. A freshness requirement of 5 minutes means the pipeline must complete within 5 minutes of new data arriving — which rules out hourly batch jobs. Document the business justification for each freshness requirement; tighter SLAs cost more to build and operate.
61
+
62
+ ### Data Quality Budgets
63
+
64
+ Every pipeline has a data quality budget — the maximum allowable error rate before downstream systems or business decisions are materially impacted. Define explicit thresholds:
65
+
66
+ **Completeness budget**: Maximum acceptable null rate or missing record rate per field
67
+ - Critical fields (revenue, user ID): 0% nulls tolerated
68
+ - Enrichment fields (device type, referrer): up to 5% nulls acceptable
69
+ - Optional metadata fields: up to 20% nulls acceptable before alerting
70
+
71
+ **Accuracy budget**: Maximum acceptable rate of values outside valid ranges or formats
72
+ - Financial amounts: 0% out-of-range values
73
+ - Timestamps: < 0.1% future-dated or epoch-zero values
74
+ - Categorical fields: < 1% unrecognized enum values
75
+
76
+ **Timeliness budget**: Maximum acceptable late-arrival rate for event-driven pipelines
77
+ - Streaming pipelines: define maximum allowable event delay before treating as late
78
+ - Batch pipelines: define maximum acceptable row count variance versus expected volume
79
+
80
+ **Consistency budget**: Maximum acceptable divergence between pipeline outputs and source of truth
81
+ - Define reconciliation checkpoints and acceptable variance thresholds
82
+ - Example: daily record count must match source system within 0.01%
83
+
84
+ Document quality budgets as tested assertions, not prose. Every quality threshold should have a corresponding data quality test (Great Expectations check, dbt test, or equivalent) that fails the pipeline when the threshold is breached.
85
+
86
+ ### Regulatory Compliance Requirements
87
+
88
+ Identify every regulatory framework that applies before writing the first line of pipeline code. Retrofitting compliance is expensive and error-prone.
89
+
90
+ **PII (Personally Identifiable Information) inventory**
91
+
92
+ Catalog every field that constitutes PII under applicable regulations:
93
+ - Direct identifiers: name, email, phone, SSN, passport number, IP address
94
+ - Quasi-identifiers: ZIP code, birthdate, job title (can identify individuals in combination)
95
+ - Sensitive categories (GDPR Article 9): health data, biometric data, racial/ethnic origin, political opinions
96
+
97
+ For each PII field, document:
98
+ - Which regulation governs it (GDPR, CCPA, HIPAA, FERPA, etc.)
99
+ - Retention limit (e.g., GDPR: must be deletable on erasure request)
100
+ - Processing lawful basis (consent, legitimate interest, contractual necessity)
101
+ - Cross-border transfer restrictions (GDPR: EU data cannot flow to non-adequate countries without SCCs)
102
+
103
+ **GDPR-specific pipeline requirements**
104
+ - Right to erasure: pipeline must support deletion of a user's data from all downstream stores, including derived tables and ML features
105
+ - Data minimization: collect only what is necessary; don't pipeline fields not needed for the stated purpose
106
+ - Purpose limitation: data collected for one purpose cannot be reused for an incompatible purpose in a new pipeline
107
+ - Records of processing activities: maintain documented inventory of all pipelines that process personal data
108
+
109
+ **Data residency requirements**
110
+ - Some regulations require data to stay within geographic boundaries (EU data in EU, Australian data in Australia)
111
+ - Pipeline infrastructure (Kafka brokers, object storage, processing clusters) must be provisioned in compliant regions
112
+ - Cross-region replication must be explicitly reviewed against residency rules
113
+
114
+ **Retention and deletion**
115
+ - Define retention periods for every data store the pipeline writes to
116
+ - Implement automated purge jobs tied to retention schedules
117
+ - Test deletion propagation: deleting from source must eventually delete from all derived stores
118
+ - Maintain deletion audit log: timestamp, scope, and confirmation of deletion completion
119
+
120
+ ### Documenting Requirements as Testable Contracts
121
+
122
+ All requirements must be expressed as testable conditions, not aspirational statements:
123
+
124
+ ```yaml
125
+ # requirements.yaml — machine-readable SLA contract
126
+ pipeline: user_events
127
+ sla:
128
+ latency_p99_seconds: 300
129
+ throughput_peak_events_per_second: 50000
130
+ freshness_max_lag_minutes: 10
131
+ quality:
132
+ completeness:
133
+ user_id: 0.0 # 0% nulls
134
+ event_type: 0.0
135
+ timestamp: 0.0
136
+ country_code: 0.05 # 5% nulls acceptable
137
+ accuracy:
138
+ timestamp_future_rate: 0.001
139
+ compliance:
140
+ pii_fields: [user_id, email, ip_address]
141
+ retention_days: 365
142
+ gdpr_erasure_supported: true
143
+ ```
144
+
145
+ Requirements documented this way feed directly into monitoring dashboards, alerting rules, and data quality test suites. When a requirement changes, the contract file changes — and all downstream tests automatically re-evaluate against the new threshold.
@@ -0,0 +1,295 @@
1
+ ---
2
+ name: data-pipeline-schema-management
3
+ description: Schema registry, evolution patterns, breaking change detection, and contract testing for data pipeline schemas
4
+ topics: [data-pipeline, schema-management, schema-registry, schema-evolution, breaking-changes, contract-testing, avro, protobuf]
5
+ ---
6
+
7
+ Schema management is the discipline of controlling how the structure of data changes over time while maintaining compatibility between producers and consumers. Without schema governance, a single field rename in a source system cascades into dozens of broken downstream pipelines, dashboards, and ML models. Schema registry, evolution compatibility rules, breaking change detection, and consumer contract tests are the four pillars that make schema changes safe.
8
+
9
+ ## Summary
10
+
11
+ Register all schemas in a centralized schema registry (Confluent Schema Registry, AWS Glue Schema Registry, or equivalent). Enforce backward compatibility as the minimum standard — new schema versions must be readable by old consumers. Classify schema changes as breaking or non-breaking before deployment. Run consumer contract tests in CI to catch breaking changes before they reach production. Use schema versioning (`_v2`) for irreversible breaking changes.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Schema Registry
16
+
17
+ A schema registry is a centralized service that stores schema versions and enforces compatibility rules at produce and consume time:
18
+
19
+ **Confluent Schema Registry setup**
20
+
21
+ ```python
22
+ from confluent_kafka.schema_registry import SchemaRegistryClient
23
+ from confluent_kafka.schema_registry.avro import AvroSerializer, AvroDeserializer
24
+
25
+ schema_registry_client = SchemaRegistryClient({"url": "http://localhost:8081"})
26
+
27
+ # Define Avro schema
28
+ TRANSACTION_SCHEMA_STR = """
29
+ {
30
+ "type": "record",
31
+ "name": "Transaction",
32
+ "namespace": "com.example.payments",
33
+ "fields": [
34
+ {"name": "transaction_id", "type": "string"},
35
+ {"name": "amount", "type": "double"},
36
+ {"name": "currency", "type": "string"},
37
+ {"name": "user_id", "type": "string"},
38
+ {"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
39
+ ]
40
+ }
41
+ """
42
+
43
+ # Register schema (idempotent — same schema = same ID returned)
44
+ schema = schema_registry_client.register_schema(
45
+ subject_name="payments_transaction_created-value",
46
+ schema=Schema(TRANSACTION_SCHEMA_STR, schema_type="AVRO"),
47
+ )
48
+
49
+ # Serialize with schema ID embedded in message
50
+ avro_serializer = AvroSerializer(
51
+ schema_registry_client=schema_registry_client,
52
+ schema_str=TRANSACTION_SCHEMA_STR,
53
+ )
54
+ ```
55
+
56
+ **Schema subjects**
57
+
58
+ Schema Registry organizes schemas by subject. The standard naming convention for Kafka topics:
59
+ - Value schema: `{topic-name}-value` (e.g., `payments_transaction_created-value`)
60
+ - Key schema: `{topic-name}-key` (e.g., `payments_transaction_created-key`)
61
+
62
+ For non-Kafka schemas (database tables, API contracts): use `{domain}.{entity}.{version}` as the subject name.
63
+
64
+ ### Compatibility Modes
65
+
66
+ Schema Registry enforces compatibility between schema versions. Choose the strictest compatibility mode that your deployment workflow supports:
67
+
68
+ **BACKWARD (recommended default)**
69
+ - New schema can read data written by old schema
70
+ - Consumers can be upgraded before producers
71
+ - Allowed changes: add optional fields (with defaults), remove fields
72
+
73
+ ```json
74
+ // v1 schema
75
+ {"fields": [{"name": "id"}, {"name": "amount"}]}
76
+
77
+ // v2 schema (BACKWARD compatible: added optional field with default)
78
+ {"fields": [{"name": "id"}, {"name": "amount"}, {"name": "currency", "default": "USD"}]}
79
+ ```
80
+
81
+ **FORWARD**
82
+ - Old schema can read data written by new schema
83
+ - Producers can be upgraded before consumers
84
+ - Allowed changes: remove optional fields, add fields without defaults
85
+
86
+ **FULL**
87
+ - Both backward and forward compatible
88
+ - Most restrictive: only add optional fields with defaults, or remove optional fields
89
+ - Recommended when upgrade order cannot be guaranteed
90
+
91
+ **NONE**
92
+ - No compatibility enforcement
93
+ - Any change is allowed
94
+ - Only appropriate for development environments
95
+
96
+ **Setting compatibility per subject**
97
+
98
+ ```python
99
+ schema_registry_client.set_compatibility(
100
+ subject_name="payments_transaction_created-value",
101
+ level="BACKWARD",
102
+ )
103
+ ```
104
+
105
+ ### Evolution Patterns
106
+
107
+ **Adding a field (backward compatible)**
108
+
109
+ Always provide a default value for new fields in Avro schemas. This allows old consumers (who don't know about the new field) to successfully deserialize new messages:
110
+
111
+ ```json
112
+ // Safe: new field with default
113
+ {"name": "merchant_category", "type": ["null", "string"], "default": null}
114
+ ```
115
+
116
+ Do NOT add a required field without a default — this breaks BACKWARD compatibility.
117
+
118
+ **Removing a field**
119
+
120
+ Removing a field is backward compatible only if consumers that use that field are updated first. Coordinate the removal:
121
+ 1. Deprecate the field: add a comment/annotation that it will be removed in 30 days
122
+ 2. Update all consumers to stop reading the field
123
+ 3. Verify no consumers reference the field (run contract tests)
124
+ 4. Remove the field from the schema
125
+
126
+ **Renaming a field (breaking change)**
127
+
128
+ Renaming is always a breaking change. The canonical approach:
129
+ 1. Add the new field name alongside the old field (dual-write period)
130
+ 2. Update all consumers to read from the new field
131
+ 3. Verify all consumers migrated
132
+ 4. Remove the old field
133
+
134
+ For Avro, use `aliases` to support rename during transition:
135
+ ```json
136
+ {"name": "merchant_id", "aliases": ["vendor_id"], "type": "string"}
137
+ ```
138
+
139
+ **Changing a field type (breaking change)**
140
+
141
+ Type changes are breaking. Example: changing `amount` from `int` to `double`.
142
+
143
+ Options:
144
+ - Create a new versioned schema subject (`payments_transaction_created_v2-value`)
145
+ - Use a union type during transition: `{"type": ["int", "double"]}`
146
+ - Use a new field with the new type alongside the old field
147
+
148
+ ### Breaking Change Detection
149
+
150
+ Detect breaking changes before they reach production using automated tooling:
151
+
152
+ **Schema compatibility check in CI**
153
+
154
+ ```bash
155
+ #!/bin/bash
156
+ # scripts/check-schema-compatibility.sh
157
+
158
+ for schema_file in schemas/**/*.avsc; do
159
+ subject=$(basename "$schema_file" .avsc)
160
+ echo "==> Checking compatibility for $subject"
161
+
162
+ # Test compatibility against all registered versions
163
+ result=$(curl -s -X POST \
164
+ "http://schema-registry:8081/compatibility/subjects/${subject}-value/versions/latest" \
165
+ -H "Content-Type: application/vnd.schemaregistry.v1+json" \
166
+ -d "{\"schema\": $(cat "$schema_file")}")
167
+
168
+ if [[ "$(echo "$result" | jq -r '.is_compatible')" != "true" ]]; then
169
+ echo "ERROR: Breaking change detected in $subject"
170
+ echo "$result" | jq .
171
+ exit 1
172
+ fi
173
+ done
174
+ echo "All schemas are compatible"
175
+ ```
176
+
177
+ **Buf for Protobuf schemas**
178
+
179
+ For Protobuf-based pipelines, use Buf to detect breaking changes:
180
+
181
+ ```bash
182
+ # buf.yaml
183
+ version: v1
184
+ breaking:
185
+ use:
186
+ - FILE # detect file-level breaking changes
187
+
188
+ # In CI:
189
+ buf breaking --against ".git#branch=main"
190
+ ```
191
+
192
+ **JSONSchema diff tools**
193
+
194
+ ```python
195
+ from json_schema_diff import diff_schemas
196
+
197
+ def check_json_schema_breaking_changes(old_schema: dict, new_schema: dict) -> list[str]:
198
+ """Return list of breaking changes between schema versions."""
199
+ differences = diff_schemas(old_schema, new_schema)
200
+ breaking = [d for d in differences if d.is_breaking]
201
+ return [str(d) for d in breaking]
202
+ ```
203
+
204
+ ### Contract Testing
205
+
206
+ Consumer contract tests verify that a schema change does not break any known consumer. They run in CI for every schema change.
207
+
208
+ **Consumer-driven contract testing with Pact**
209
+
210
+ ```python
211
+ # Consumer (silver processor): define the contract
212
+ from pact import Consumer, Provider
213
+
214
+ pact = Consumer("silver-processor").has_pact_with(Provider("bronze-events"))
215
+
216
+ def test_can_consume_transaction_event():
217
+ """Consumer contract: silver processor can consume bronze transaction events."""
218
+ expected_event = {
219
+ "transaction_id": "txn_abc123",
220
+ "amount": 99.99,
221
+ "currency": "USD",
222
+ "user_id": "usr_001234",
223
+ "created_at": 1705329825000,
224
+ }
225
+
226
+ (pact
227
+ .given("a valid transaction event exists")
228
+ .upon_receiving("a request for transaction events")
229
+ .with_request("GET", "/events/transaction")
230
+ .will_respond_with(200, body=expected_event))
231
+
232
+ with pact:
233
+ result = silver_processor.consume_event(get_test_event())
234
+ assert result["transaction_id"] == "txn_abc123"
235
+ ```
236
+
237
+ **Schema-based contract tests without Pact**
238
+
239
+ For Kafka-based pipelines, consumer contracts can be expressed as schema validation:
240
+
241
+ ```python
242
+ # tests/contracts/test_transaction_consumer_contract.py
243
+
244
+ CONSUMER_REQUIRED_FIELDS = {
245
+ "silver_processor": ["transaction_id", "amount", "currency", "user_id", "created_at"],
246
+ "fraud_detector": ["transaction_id", "amount", "user_id", "merchant_id"],
247
+ "revenue_aggregator": ["transaction_id", "amount", "currency", "created_at"],
248
+ }
249
+
250
+ def test_schema_satisfies_all_consumer_contracts():
251
+ """Verify that the current schema provides all fields required by registered consumers."""
252
+ with open("schemas/payments_transaction_created.avsc") as f:
253
+ schema = json.load(f)
254
+ schema_fields = {f["name"] for f in schema["fields"]}
255
+
256
+ for consumer, required_fields in CONSUMER_REQUIRED_FIELDS.items():
257
+ missing = set(required_fields) - schema_fields
258
+ assert not missing, (
259
+ f"Schema breaking change: consumer '{consumer}' requires fields {missing} "
260
+ f"which are not in the current schema"
261
+ )
262
+ ```
263
+
264
+ ### Table Schema Management in the Warehouse
265
+
266
+ For SQL-based warehouses (BigQuery, Snowflake, Redshift), manage table schemas with migration files:
267
+
268
+ ```
269
+ schemas/
270
+ ├── migrations/
271
+ │ ├── 001_create_silver_transactions.sql
272
+ │ ├── 002_add_merchant_category.sql
273
+ │ └── 003_add_processing_fee.sql
274
+ └── current/
275
+ └── silver_transactions.sql # current expected schema
276
+ ```
277
+
278
+ **Migration file pattern**
279
+
280
+ ```sql
281
+ -- schemas/migrations/002_add_merchant_category.sql
282
+ -- Migration: add merchant_category field to silver_transactions
283
+ -- Author: data-team
284
+ -- Date: 2024-01-15
285
+ -- Breaking: No (new nullable column with default)
286
+ -- Consumer impact: None (additive change)
287
+
288
+ ALTER TABLE silver_transactions
289
+ ADD COLUMN IF NOT EXISTS merchant_category STRING OPTIONS(description="MCC category code");
290
+
291
+ -- Backfill for existing records (run separately after migration)
292
+ -- UPDATE silver_transactions SET merchant_category = 'UNKNOWN' WHERE merchant_category IS NULL;
293
+ ```
294
+
295
+ Automate migration application in CI/CD and verify the current schema matches expectations after each migration.