@zigrivers/scaffold 3.8.0 → 3.9.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +75 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +167 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-requirements
|
|
3
|
+
description: SLA requirements (latency, throughput, freshness), data quality budgets, and regulatory compliance (PII, GDPR) for data pipelines
|
|
4
|
+
topics: [data-pipeline, requirements, sla, latency, throughput, freshness, data-quality, pii, gdpr, compliance]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Data pipeline requirements must be defined in measurable terms before any architecture decision is made. Vague requirements like "fast" or "reliable" translate into untestable systems. Every pipeline needs explicit SLAs for latency, throughput, and data freshness — and a defined budget for data quality violations and regulatory compliance obligations.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Define pipeline SLAs in concrete numbers: end-to-end latency targets (e.g., p99 < 5 minutes), throughput capacity (e.g., 500K events/sec peak), and data freshness windows (e.g., reports no older than 1 hour). Set explicit data quality budgets — allowable error rates, null tolerances, and schema drift thresholds. Identify all PII fields and their regulatory classification (GDPR, CCPA, HIPAA) before building any pipeline that touches them.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Latency SLA Definition
|
|
16
|
+
|
|
17
|
+
Latency in data pipelines is measured end-to-end: from event occurrence to data availability in the serving layer. Define latency at percentiles, not averages:
|
|
18
|
+
|
|
19
|
+
- **p50 (median)**: Typical processing time under normal load
|
|
20
|
+
- **p95**: The target most operations should hit
|
|
21
|
+
- **p99**: The tail latency SLA — what is acceptable for 1-in-100 operations
|
|
22
|
+
- **p999**: What happens in worst-case bursts
|
|
23
|
+
|
|
24
|
+
Example SLA table:
|
|
25
|
+
|
|
26
|
+
| Pipeline Type | p50 Target | p99 Target | Max Acceptable Lag |
|
|
27
|
+
|---------------|------------|------------|-------------------|
|
|
28
|
+
| Fraud detection | < 500ms | < 2s | 10s |
|
|
29
|
+
| Operational reporting | < 5min | < 15min | 1hr |
|
|
30
|
+
| Analytics warehouse | < 1hr | < 4hr | 24hr |
|
|
31
|
+
| Compliance audit logs | < 15min | < 1hr | 4hr |
|
|
32
|
+
|
|
33
|
+
Latency SLAs drive architecture selection. A p99 < 1 second requirement rules out batch processing and mandates streaming. A p99 < 1 hour permits micro-batch or even hourly cron jobs. Match the architecture to the SLA before writing any code.
|
|
34
|
+
|
|
35
|
+
### Throughput and Capacity Planning
|
|
36
|
+
|
|
37
|
+
Throughput requirements must account for peak load, not average load. Data pipelines routinely experience 10x to 100x spikes at business events (end of month, flash sales, marketing campaigns):
|
|
38
|
+
|
|
39
|
+
- **Steady-state throughput**: Average events or records per second under normal operation
|
|
40
|
+
- **Peak throughput**: Maximum burst the system must handle without degradation
|
|
41
|
+
- **Sustained peak duration**: How long the peak must be sustained before scaling kicks in
|
|
42
|
+
- **Data volume growth rate**: Year-over-year growth trajectory for capacity planning
|
|
43
|
+
|
|
44
|
+
Document throughput in multiple units because different layers care about different metrics:
|
|
45
|
+
- Ingestion layer: events/second, bytes/second
|
|
46
|
+
- Processing layer: records/second, transformations/second
|
|
47
|
+
- Storage layer: GB/day ingested, TB total, compression ratio expected
|
|
48
|
+
|
|
49
|
+
Capacity plan to handle 3x current peak to avoid emergency scaling. If current peak is 100K events/sec, design for 300K.
|
|
50
|
+
|
|
51
|
+
### Data Freshness Requirements
|
|
52
|
+
|
|
53
|
+
Data freshness (or staleness tolerance) defines how old data can be before it loses business value. Freshness requirements vary dramatically by use case:
|
|
54
|
+
|
|
55
|
+
- **Real-time operational decisions** (fraud, recommendations): seconds to minutes of acceptable staleness
|
|
56
|
+
- **Operational dashboards** (business KPIs, support queues): minutes to one hour
|
|
57
|
+
- **Analytical reporting** (daily/weekly reports): hours to 24 hours
|
|
58
|
+
- **Historical analysis and ML training**: days are acceptable
|
|
59
|
+
|
|
60
|
+
Freshness SLAs drive pipeline scheduling and trigger mechanisms. A freshness requirement of 5 minutes means the pipeline must complete within 5 minutes of new data arriving — which rules out hourly batch jobs. Document the business justification for each freshness requirement; tighter SLAs cost more to build and operate.
|
|
61
|
+
|
|
62
|
+
### Data Quality Budgets
|
|
63
|
+
|
|
64
|
+
Every pipeline has a data quality budget — the maximum allowable error rate before downstream systems or business decisions are materially impacted. Define explicit thresholds:
|
|
65
|
+
|
|
66
|
+
**Completeness budget**: Maximum acceptable null rate or missing record rate per field
|
|
67
|
+
- Critical fields (revenue, user ID): 0% nulls tolerated
|
|
68
|
+
- Enrichment fields (device type, referrer): up to 5% nulls acceptable
|
|
69
|
+
- Optional metadata fields: up to 20% nulls acceptable before alerting
|
|
70
|
+
|
|
71
|
+
**Accuracy budget**: Maximum acceptable rate of values outside valid ranges or formats
|
|
72
|
+
- Financial amounts: 0% out-of-range values
|
|
73
|
+
- Timestamps: < 0.1% future-dated or epoch-zero values
|
|
74
|
+
- Categorical fields: < 1% unrecognized enum values
|
|
75
|
+
|
|
76
|
+
**Timeliness budget**: Maximum acceptable late-arrival rate for event-driven pipelines
|
|
77
|
+
- Streaming pipelines: define maximum allowable event delay before treating as late
|
|
78
|
+
- Batch pipelines: define maximum acceptable row count variance versus expected volume
|
|
79
|
+
|
|
80
|
+
**Consistency budget**: Maximum acceptable divergence between pipeline outputs and source of truth
|
|
81
|
+
- Define reconciliation checkpoints and acceptable variance thresholds
|
|
82
|
+
- Example: daily record count must match source system within 0.01%
|
|
83
|
+
|
|
84
|
+
Document quality budgets as tested assertions, not prose. Every quality threshold should have a corresponding data quality test (Great Expectations check, dbt test, or equivalent) that fails the pipeline when the threshold is breached.
|
|
85
|
+
|
|
86
|
+
### Regulatory Compliance Requirements
|
|
87
|
+
|
|
88
|
+
Identify every regulatory framework that applies before writing the first line of pipeline code. Retrofitting compliance is expensive and error-prone.
|
|
89
|
+
|
|
90
|
+
**PII (Personally Identifiable Information) inventory**
|
|
91
|
+
|
|
92
|
+
Catalog every field that constitutes PII under applicable regulations:
|
|
93
|
+
- Direct identifiers: name, email, phone, SSN, passport number, IP address
|
|
94
|
+
- Quasi-identifiers: ZIP code, birthdate, job title (can identify individuals in combination)
|
|
95
|
+
- Sensitive categories (GDPR Article 9): health data, biometric data, racial/ethnic origin, political opinions
|
|
96
|
+
|
|
97
|
+
For each PII field, document:
|
|
98
|
+
- Which regulation governs it (GDPR, CCPA, HIPAA, FERPA, etc.)
|
|
99
|
+
- Retention limit (e.g., GDPR: must be deletable on erasure request)
|
|
100
|
+
- Processing lawful basis (consent, legitimate interest, contractual necessity)
|
|
101
|
+
- Cross-border transfer restrictions (GDPR: EU data cannot flow to non-adequate countries without SCCs)
|
|
102
|
+
|
|
103
|
+
**GDPR-specific pipeline requirements**
|
|
104
|
+
- Right to erasure: pipeline must support deletion of a user's data from all downstream stores, including derived tables and ML features
|
|
105
|
+
- Data minimization: collect only what is necessary; don't pipeline fields not needed for the stated purpose
|
|
106
|
+
- Purpose limitation: data collected for one purpose cannot be reused for an incompatible purpose in a new pipeline
|
|
107
|
+
- Records of processing activities: maintain documented inventory of all pipelines that process personal data
|
|
108
|
+
|
|
109
|
+
**Data residency requirements**
|
|
110
|
+
- Some regulations require data to stay within geographic boundaries (EU data in EU, Australian data in Australia)
|
|
111
|
+
- Pipeline infrastructure (Kafka brokers, object storage, processing clusters) must be provisioned in compliant regions
|
|
112
|
+
- Cross-region replication must be explicitly reviewed against residency rules
|
|
113
|
+
|
|
114
|
+
**Retention and deletion**
|
|
115
|
+
- Define retention periods for every data store the pipeline writes to
|
|
116
|
+
- Implement automated purge jobs tied to retention schedules
|
|
117
|
+
- Test deletion propagation: deleting from source must eventually delete from all derived stores
|
|
118
|
+
- Maintain deletion audit log: timestamp, scope, and confirmation of deletion completion
|
|
119
|
+
|
|
120
|
+
### Documenting Requirements as Testable Contracts
|
|
121
|
+
|
|
122
|
+
All requirements must be expressed as testable conditions, not aspirational statements:
|
|
123
|
+
|
|
124
|
+
```yaml
|
|
125
|
+
# requirements.yaml — machine-readable SLA contract
|
|
126
|
+
pipeline: user_events
|
|
127
|
+
sla:
|
|
128
|
+
latency_p99_seconds: 300
|
|
129
|
+
throughput_peak_events_per_second: 50000
|
|
130
|
+
freshness_max_lag_minutes: 10
|
|
131
|
+
quality:
|
|
132
|
+
completeness:
|
|
133
|
+
user_id: 0.0 # 0% nulls
|
|
134
|
+
event_type: 0.0
|
|
135
|
+
timestamp: 0.0
|
|
136
|
+
country_code: 0.05 # 5% nulls acceptable
|
|
137
|
+
accuracy:
|
|
138
|
+
timestamp_future_rate: 0.001
|
|
139
|
+
compliance:
|
|
140
|
+
pii_fields: [user_id, email, ip_address]
|
|
141
|
+
retention_days: 365
|
|
142
|
+
gdpr_erasure_supported: true
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Requirements documented this way feed directly into monitoring dashboards, alerting rules, and data quality test suites. When a requirement changes, the contract file changes — and all downstream tests automatically re-evaluate against the new threshold.
|
|
@@ -0,0 +1,295 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-schema-management
|
|
3
|
+
description: Schema registry, evolution patterns, breaking change detection, and contract testing for data pipeline schemas
|
|
4
|
+
topics: [data-pipeline, schema-management, schema-registry, schema-evolution, breaking-changes, contract-testing, avro, protobuf]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Schema management is the discipline of controlling how the structure of data changes over time while maintaining compatibility between producers and consumers. Without schema governance, a single field rename in a source system cascades into dozens of broken downstream pipelines, dashboards, and ML models. Schema registry, evolution compatibility rules, breaking change detection, and consumer contract tests are the four pillars that make schema changes safe.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Register all schemas in a centralized schema registry (Confluent Schema Registry, AWS Glue Schema Registry, or equivalent). Enforce backward compatibility as the minimum standard — new schema versions must be readable by old consumers. Classify schema changes as breaking or non-breaking before deployment. Run consumer contract tests in CI to catch breaking changes before they reach production. Use schema versioning (`_v2`) for irreversible breaking changes.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Schema Registry
|
|
16
|
+
|
|
17
|
+
A schema registry is a centralized service that stores schema versions and enforces compatibility rules at produce and consume time:
|
|
18
|
+
|
|
19
|
+
**Confluent Schema Registry setup**
|
|
20
|
+
|
|
21
|
+
```python
|
|
22
|
+
from confluent_kafka.schema_registry import SchemaRegistryClient
|
|
23
|
+
from confluent_kafka.schema_registry.avro import AvroSerializer, AvroDeserializer
|
|
24
|
+
|
|
25
|
+
schema_registry_client = SchemaRegistryClient({"url": "http://localhost:8081"})
|
|
26
|
+
|
|
27
|
+
# Define Avro schema
|
|
28
|
+
TRANSACTION_SCHEMA_STR = """
|
|
29
|
+
{
|
|
30
|
+
"type": "record",
|
|
31
|
+
"name": "Transaction",
|
|
32
|
+
"namespace": "com.example.payments",
|
|
33
|
+
"fields": [
|
|
34
|
+
{"name": "transaction_id", "type": "string"},
|
|
35
|
+
{"name": "amount", "type": "double"},
|
|
36
|
+
{"name": "currency", "type": "string"},
|
|
37
|
+
{"name": "user_id", "type": "string"},
|
|
38
|
+
{"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
|
|
39
|
+
]
|
|
40
|
+
}
|
|
41
|
+
"""
|
|
42
|
+
|
|
43
|
+
# Register schema (idempotent — same schema = same ID returned)
|
|
44
|
+
schema = schema_registry_client.register_schema(
|
|
45
|
+
subject_name="payments_transaction_created-value",
|
|
46
|
+
schema=Schema(TRANSACTION_SCHEMA_STR, schema_type="AVRO"),
|
|
47
|
+
)
|
|
48
|
+
|
|
49
|
+
# Serialize with schema ID embedded in message
|
|
50
|
+
avro_serializer = AvroSerializer(
|
|
51
|
+
schema_registry_client=schema_registry_client,
|
|
52
|
+
schema_str=TRANSACTION_SCHEMA_STR,
|
|
53
|
+
)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Schema subjects**
|
|
57
|
+
|
|
58
|
+
Schema Registry organizes schemas by subject. The standard naming convention for Kafka topics:
|
|
59
|
+
- Value schema: `{topic-name}-value` (e.g., `payments_transaction_created-value`)
|
|
60
|
+
- Key schema: `{topic-name}-key` (e.g., `payments_transaction_created-key`)
|
|
61
|
+
|
|
62
|
+
For non-Kafka schemas (database tables, API contracts): use `{domain}.{entity}.{version}` as the subject name.
|
|
63
|
+
|
|
64
|
+
### Compatibility Modes
|
|
65
|
+
|
|
66
|
+
Schema Registry enforces compatibility between schema versions. Choose the strictest compatibility mode that your deployment workflow supports:
|
|
67
|
+
|
|
68
|
+
**BACKWARD (recommended default)**
|
|
69
|
+
- New schema can read data written by old schema
|
|
70
|
+
- Consumers can be upgraded before producers
|
|
71
|
+
- Allowed changes: add optional fields (with defaults), remove fields
|
|
72
|
+
|
|
73
|
+
```json
|
|
74
|
+
// v1 schema
|
|
75
|
+
{"fields": [{"name": "id"}, {"name": "amount"}]}
|
|
76
|
+
|
|
77
|
+
// v2 schema (BACKWARD compatible: added optional field with default)
|
|
78
|
+
{"fields": [{"name": "id"}, {"name": "amount"}, {"name": "currency", "default": "USD"}]}
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**FORWARD**
|
|
82
|
+
- Old schema can read data written by new schema
|
|
83
|
+
- Producers can be upgraded before consumers
|
|
84
|
+
- Allowed changes: remove optional fields, add fields without defaults
|
|
85
|
+
|
|
86
|
+
**FULL**
|
|
87
|
+
- Both backward and forward compatible
|
|
88
|
+
- Most restrictive: only add optional fields with defaults, or remove optional fields
|
|
89
|
+
- Recommended when upgrade order cannot be guaranteed
|
|
90
|
+
|
|
91
|
+
**NONE**
|
|
92
|
+
- No compatibility enforcement
|
|
93
|
+
- Any change is allowed
|
|
94
|
+
- Only appropriate for development environments
|
|
95
|
+
|
|
96
|
+
**Setting compatibility per subject**
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
schema_registry_client.set_compatibility(
|
|
100
|
+
subject_name="payments_transaction_created-value",
|
|
101
|
+
level="BACKWARD",
|
|
102
|
+
)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Evolution Patterns
|
|
106
|
+
|
|
107
|
+
**Adding a field (backward compatible)**
|
|
108
|
+
|
|
109
|
+
Always provide a default value for new fields in Avro schemas. This allows old consumers (who don't know about the new field) to successfully deserialize new messages:
|
|
110
|
+
|
|
111
|
+
```json
|
|
112
|
+
// Safe: new field with default
|
|
113
|
+
{"name": "merchant_category", "type": ["null", "string"], "default": null}
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Do NOT add a required field without a default — this breaks BACKWARD compatibility.
|
|
117
|
+
|
|
118
|
+
**Removing a field**
|
|
119
|
+
|
|
120
|
+
Removing a field is backward compatible only if consumers that use that field are updated first. Coordinate the removal:
|
|
121
|
+
1. Deprecate the field: add a comment/annotation that it will be removed in 30 days
|
|
122
|
+
2. Update all consumers to stop reading the field
|
|
123
|
+
3. Verify no consumers reference the field (run contract tests)
|
|
124
|
+
4. Remove the field from the schema
|
|
125
|
+
|
|
126
|
+
**Renaming a field (breaking change)**
|
|
127
|
+
|
|
128
|
+
Renaming is always a breaking change. The canonical approach:
|
|
129
|
+
1. Add the new field name alongside the old field (dual-write period)
|
|
130
|
+
2. Update all consumers to read from the new field
|
|
131
|
+
3. Verify all consumers migrated
|
|
132
|
+
4. Remove the old field
|
|
133
|
+
|
|
134
|
+
For Avro, use `aliases` to support rename during transition:
|
|
135
|
+
```json
|
|
136
|
+
{"name": "merchant_id", "aliases": ["vendor_id"], "type": "string"}
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Changing a field type (breaking change)**
|
|
140
|
+
|
|
141
|
+
Type changes are breaking. Example: changing `amount` from `int` to `double`.
|
|
142
|
+
|
|
143
|
+
Options:
|
|
144
|
+
- Create a new versioned schema subject (`payments_transaction_created_v2-value`)
|
|
145
|
+
- Use a union type during transition: `{"type": ["int", "double"]}`
|
|
146
|
+
- Use a new field with the new type alongside the old field
|
|
147
|
+
|
|
148
|
+
### Breaking Change Detection
|
|
149
|
+
|
|
150
|
+
Detect breaking changes before they reach production using automated tooling:
|
|
151
|
+
|
|
152
|
+
**Schema compatibility check in CI**
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
#!/bin/bash
|
|
156
|
+
# scripts/check-schema-compatibility.sh
|
|
157
|
+
|
|
158
|
+
for schema_file in schemas/**/*.avsc; do
|
|
159
|
+
subject=$(basename "$schema_file" .avsc)
|
|
160
|
+
echo "==> Checking compatibility for $subject"
|
|
161
|
+
|
|
162
|
+
# Test compatibility against all registered versions
|
|
163
|
+
result=$(curl -s -X POST \
|
|
164
|
+
"http://schema-registry:8081/compatibility/subjects/${subject}-value/versions/latest" \
|
|
165
|
+
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
|
|
166
|
+
-d "{\"schema\": $(cat "$schema_file")}")
|
|
167
|
+
|
|
168
|
+
if [[ "$(echo "$result" | jq -r '.is_compatible')" != "true" ]]; then
|
|
169
|
+
echo "ERROR: Breaking change detected in $subject"
|
|
170
|
+
echo "$result" | jq .
|
|
171
|
+
exit 1
|
|
172
|
+
fi
|
|
173
|
+
done
|
|
174
|
+
echo "All schemas are compatible"
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Buf for Protobuf schemas**
|
|
178
|
+
|
|
179
|
+
For Protobuf-based pipelines, use Buf to detect breaking changes:
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
# buf.yaml
|
|
183
|
+
version: v1
|
|
184
|
+
breaking:
|
|
185
|
+
use:
|
|
186
|
+
- FILE # detect file-level breaking changes
|
|
187
|
+
|
|
188
|
+
# In CI:
|
|
189
|
+
buf breaking --against ".git#branch=main"
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
**JSONSchema diff tools**
|
|
193
|
+
|
|
194
|
+
```python
|
|
195
|
+
from json_schema_diff import diff_schemas
|
|
196
|
+
|
|
197
|
+
def check_json_schema_breaking_changes(old_schema: dict, new_schema: dict) -> list[str]:
|
|
198
|
+
"""Return list of breaking changes between schema versions."""
|
|
199
|
+
differences = diff_schemas(old_schema, new_schema)
|
|
200
|
+
breaking = [d for d in differences if d.is_breaking]
|
|
201
|
+
return [str(d) for d in breaking]
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
### Contract Testing
|
|
205
|
+
|
|
206
|
+
Consumer contract tests verify that a schema change does not break any known consumer. They run in CI for every schema change.
|
|
207
|
+
|
|
208
|
+
**Consumer-driven contract testing with Pact**
|
|
209
|
+
|
|
210
|
+
```python
|
|
211
|
+
# Consumer (silver processor): define the contract
|
|
212
|
+
from pact import Consumer, Provider
|
|
213
|
+
|
|
214
|
+
pact = Consumer("silver-processor").has_pact_with(Provider("bronze-events"))
|
|
215
|
+
|
|
216
|
+
def test_can_consume_transaction_event():
|
|
217
|
+
"""Consumer contract: silver processor can consume bronze transaction events."""
|
|
218
|
+
expected_event = {
|
|
219
|
+
"transaction_id": "txn_abc123",
|
|
220
|
+
"amount": 99.99,
|
|
221
|
+
"currency": "USD",
|
|
222
|
+
"user_id": "usr_001234",
|
|
223
|
+
"created_at": 1705329825000,
|
|
224
|
+
}
|
|
225
|
+
|
|
226
|
+
(pact
|
|
227
|
+
.given("a valid transaction event exists")
|
|
228
|
+
.upon_receiving("a request for transaction events")
|
|
229
|
+
.with_request("GET", "/events/transaction")
|
|
230
|
+
.will_respond_with(200, body=expected_event))
|
|
231
|
+
|
|
232
|
+
with pact:
|
|
233
|
+
result = silver_processor.consume_event(get_test_event())
|
|
234
|
+
assert result["transaction_id"] == "txn_abc123"
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
**Schema-based contract tests without Pact**
|
|
238
|
+
|
|
239
|
+
For Kafka-based pipelines, consumer contracts can be expressed as schema validation:
|
|
240
|
+
|
|
241
|
+
```python
|
|
242
|
+
# tests/contracts/test_transaction_consumer_contract.py
|
|
243
|
+
|
|
244
|
+
CONSUMER_REQUIRED_FIELDS = {
|
|
245
|
+
"silver_processor": ["transaction_id", "amount", "currency", "user_id", "created_at"],
|
|
246
|
+
"fraud_detector": ["transaction_id", "amount", "user_id", "merchant_id"],
|
|
247
|
+
"revenue_aggregator": ["transaction_id", "amount", "currency", "created_at"],
|
|
248
|
+
}
|
|
249
|
+
|
|
250
|
+
def test_schema_satisfies_all_consumer_contracts():
|
|
251
|
+
"""Verify that the current schema provides all fields required by registered consumers."""
|
|
252
|
+
with open("schemas/payments_transaction_created.avsc") as f:
|
|
253
|
+
schema = json.load(f)
|
|
254
|
+
schema_fields = {f["name"] for f in schema["fields"]}
|
|
255
|
+
|
|
256
|
+
for consumer, required_fields in CONSUMER_REQUIRED_FIELDS.items():
|
|
257
|
+
missing = set(required_fields) - schema_fields
|
|
258
|
+
assert not missing, (
|
|
259
|
+
f"Schema breaking change: consumer '{consumer}' requires fields {missing} "
|
|
260
|
+
f"which are not in the current schema"
|
|
261
|
+
)
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
### Table Schema Management in the Warehouse
|
|
265
|
+
|
|
266
|
+
For SQL-based warehouses (BigQuery, Snowflake, Redshift), manage table schemas with migration files:
|
|
267
|
+
|
|
268
|
+
```
|
|
269
|
+
schemas/
|
|
270
|
+
├── migrations/
|
|
271
|
+
│ ├── 001_create_silver_transactions.sql
|
|
272
|
+
│ ├── 002_add_merchant_category.sql
|
|
273
|
+
│ └── 003_add_processing_fee.sql
|
|
274
|
+
└── current/
|
|
275
|
+
└── silver_transactions.sql # current expected schema
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
**Migration file pattern**
|
|
279
|
+
|
|
280
|
+
```sql
|
|
281
|
+
-- schemas/migrations/002_add_merchant_category.sql
|
|
282
|
+
-- Migration: add merchant_category field to silver_transactions
|
|
283
|
+
-- Author: data-team
|
|
284
|
+
-- Date: 2024-01-15
|
|
285
|
+
-- Breaking: No (new nullable column with default)
|
|
286
|
+
-- Consumer impact: None (additive change)
|
|
287
|
+
|
|
288
|
+
ALTER TABLE silver_transactions
|
|
289
|
+
ADD COLUMN IF NOT EXISTS merchant_category STRING OPTIONS(description="MCC category code");
|
|
290
|
+
|
|
291
|
+
-- Backfill for existing records (run separately after migration)
|
|
292
|
+
-- UPDATE silver_transactions SET merchant_category = 'UNKNOWN' WHERE merchant_category IS NULL;
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
Automate migration application in CI/CD and verify the current schema matches expectations after each migration.
|