@zigrivers/scaffold 3.8.0 → 3.9.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +75 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +167 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,326 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-security
|
|
3
|
+
description: PII handling, encryption at rest and in transit, access control, and audit logging for data pipelines
|
|
4
|
+
topics: [data-pipeline, security, pii, encryption, access-control, audit-logging, gdpr, data-masking]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Data pipelines are high-value targets for security incidents because they aggregate sensitive data from multiple systems and often run with broad service account permissions. A pipeline that ingests payment data, user PII, and health records and writes to a central warehouse becomes a single point of compromise. Security must be designed into the pipeline from the start — encryption, least-privilege access, PII handling, and audit logging cannot be retrofitted economically.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Encrypt all data at rest and in transit using current standards (TLS 1.2+, AES-256). Apply least-privilege access: each pipeline service account reads only the sources it needs and writes only to its designated destinations. Detect and classify all PII fields at ingestion; mask, tokenize, or drop PII before it reaches analytical layers. Log all data access and transformation events to an immutable audit trail. Test security controls in CI.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Encryption at Rest
|
|
16
|
+
|
|
17
|
+
Every data store the pipeline writes to must encrypt data at rest:
|
|
18
|
+
|
|
19
|
+
**Object storage (S3, GCS, Azure Blob)**
|
|
20
|
+
|
|
21
|
+
```python
|
|
22
|
+
# S3: server-side encryption with AWS KMS
|
|
23
|
+
s3_client = boto3.client("s3")
|
|
24
|
+
s3_client.put_object(
|
|
25
|
+
Bucket="raw-data",
|
|
26
|
+
Key="payments/2024-01-15/transactions.parquet",
|
|
27
|
+
Body=parquet_bytes,
|
|
28
|
+
ServerSideEncryption="aws:kms",
|
|
29
|
+
SSEKMSKeyId="arn:aws:kms:us-east-1:123456789:key/mrk-abc123",
|
|
30
|
+
)
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Use customer-managed KMS keys (CMKs), not AWS-managed keys, for data subject to regulatory requirements. CMKs allow key rotation control and key deletion for data erasure compliance (GDPR right to erasure via key deletion).
|
|
34
|
+
|
|
35
|
+
**Database encryption**
|
|
36
|
+
|
|
37
|
+
Configure encryption at rest for all databases the pipeline reads from or writes to:
|
|
38
|
+
- PostgreSQL: `pg_tde` extension or filesystem-level encryption (LUKS, dm-crypt)
|
|
39
|
+
- BigQuery: CMEK (customer-managed encryption keys) via `--kms_key` parameter
|
|
40
|
+
- Snowflake: Tri-Secret Secure (Snowflake + customer key + Snowflake-managed key)
|
|
41
|
+
- Kafka: enable log encryption using KMS-backed encryption keys
|
|
42
|
+
|
|
43
|
+
**Parquet/Avro field-level encryption**
|
|
44
|
+
|
|
45
|
+
For sensitive fields, apply field-level encryption in addition to file-level encryption. This allows analytic queries on encrypted files while keeping specific columns (PII) unreadable without the column-level key:
|
|
46
|
+
|
|
47
|
+
```python
|
|
48
|
+
from pyarrow.parquet import ParquetFile
|
|
49
|
+
from pyarrow import compute as pc
|
|
50
|
+
|
|
51
|
+
# Write with field-level encryption using Apache Parquet encryption
|
|
52
|
+
encryption_config = pq.EncryptionConfiguration(
|
|
53
|
+
footer_key="footer_key_id",
|
|
54
|
+
column_keys={
|
|
55
|
+
"user_pii_key": ["email", "phone", "ssn_hash"],
|
|
56
|
+
"default_key": None, # all other columns use footer key
|
|
57
|
+
},
|
|
58
|
+
)
|
|
59
|
+
pq.write_table(table, "output.parquet", encryption_properties=encryption_config)
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### Encryption in Transit
|
|
63
|
+
|
|
64
|
+
All data in transit must use TLS 1.2 or higher. Disable older protocols and weak cipher suites:
|
|
65
|
+
|
|
66
|
+
**Kafka TLS configuration**
|
|
67
|
+
|
|
68
|
+
```properties
|
|
69
|
+
# Kafka broker: enforce TLS
|
|
70
|
+
listeners=SSL://:9093
|
|
71
|
+
ssl.keystore.location=/ssl/kafka.server.keystore.jks
|
|
72
|
+
ssl.keystore.password=${KEYSTORE_PASSWORD}
|
|
73
|
+
ssl.truststore.location=/ssl/kafka.server.truststore.jks
|
|
74
|
+
ssl.truststore.password=${TRUSTSTORE_PASSWORD}
|
|
75
|
+
ssl.client.auth=required
|
|
76
|
+
ssl.enabled.protocols=TLSv1.2,TLSv1.3
|
|
77
|
+
ssl.cipher.suites=TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256
|
|
78
|
+
|
|
79
|
+
# Producer/consumer: mutual TLS
|
|
80
|
+
security.protocol=SSL
|
|
81
|
+
ssl.keystore.location=/ssl/client.keystore.jks
|
|
82
|
+
ssl.keystore.password=${CLIENT_KEYSTORE_PASSWORD}
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
**API and database connections**
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
# PostgreSQL: require SSL
|
|
89
|
+
conn = psycopg2.connect(
|
|
90
|
+
host="db.example.com",
|
|
91
|
+
dbname="source_db",
|
|
92
|
+
user=os.environ["DB_USER"],
|
|
93
|
+
password=os.environ["DB_PASSWORD"],
|
|
94
|
+
sslmode="verify-full", # reject if cert invalid
|
|
95
|
+
sslrootcert="/certs/ca.pem",
|
|
96
|
+
)
|
|
97
|
+
|
|
98
|
+
# HTTP clients: verify certificates, no plain HTTP
|
|
99
|
+
import httpx
|
|
100
|
+
client = httpx.Client(verify=True) # default; never set verify=False in production
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### Least-Privilege Access Control
|
|
104
|
+
|
|
105
|
+
Each pipeline component should have the minimum permissions necessary to perform its function. Service accounts should never have admin-level permissions.
|
|
106
|
+
|
|
107
|
+
**AWS IAM policy for pipeline service account**
|
|
108
|
+
|
|
109
|
+
```json
|
|
110
|
+
{
|
|
111
|
+
"Version": "2012-10-17",
|
|
112
|
+
"Statement": [
|
|
113
|
+
{
|
|
114
|
+
"Effect": "Allow",
|
|
115
|
+
"Action": ["s3:GetObject", "s3:ListBucket"],
|
|
116
|
+
"Resource": [
|
|
117
|
+
"arn:aws:s3:::raw-data",
|
|
118
|
+
"arn:aws:s3:::raw-data/payments/*"
|
|
119
|
+
]
|
|
120
|
+
},
|
|
121
|
+
{
|
|
122
|
+
"Effect": "Allow",
|
|
123
|
+
"Action": ["s3:PutObject"],
|
|
124
|
+
"Resource": "arn:aws:s3:::silver-data/transactions/*"
|
|
125
|
+
},
|
|
126
|
+
{
|
|
127
|
+
"Effect": "Deny",
|
|
128
|
+
"Action": ["s3:DeleteObject", "s3:DeleteBucket"],
|
|
129
|
+
"Resource": "*"
|
|
130
|
+
}
|
|
131
|
+
]
|
|
132
|
+
}
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
**BigQuery column-level access control**
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
# Grant access to non-PII columns only for analytics users
|
|
139
|
+
from google.cloud import bigquery
|
|
140
|
+
|
|
141
|
+
client = bigquery.Client()
|
|
142
|
+
table_ref = client.get_table("project.silver.transactions")
|
|
143
|
+
|
|
144
|
+
# Column-level security policy: analytics role cannot read PII columns
|
|
145
|
+
table_ref.schema = [
|
|
146
|
+
bigquery.SchemaField("transaction_id", "STRING"),
|
|
147
|
+
bigquery.SchemaField("amount", "FLOAT64"),
|
|
148
|
+
bigquery.SchemaField("currency", "STRING"),
|
|
149
|
+
bigquery.SchemaField("email", "STRING",
|
|
150
|
+
policy_tags=bigquery.PolicyTagList(["projects/project/locations/us/taxonomies/123/policyTags/456"])),
|
|
151
|
+
]
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
**Secret management**
|
|
155
|
+
|
|
156
|
+
Never store credentials in pipeline code, DAG files, config files, or environment variables on shared machines:
|
|
157
|
+
|
|
158
|
+
```python
|
|
159
|
+
# Bad: hardcoded in code
|
|
160
|
+
DB_PASSWORD = "s3cr3t"
|
|
161
|
+
|
|
162
|
+
# Bad: plaintext in config file
|
|
163
|
+
# config/prod.yaml: db_password: s3cr3t
|
|
164
|
+
|
|
165
|
+
# Good: fetch from secrets manager at runtime
|
|
166
|
+
import boto3
|
|
167
|
+
|
|
168
|
+
def get_secret(secret_name: str) -> str:
|
|
169
|
+
client = boto3.client("secretsmanager", region_name="us-east-1")
|
|
170
|
+
response = client.get_secret_value(SecretId=secret_name)
|
|
171
|
+
return response["SecretString"]
|
|
172
|
+
|
|
173
|
+
DB_PASSWORD = get_secret("prod/pipeline/db-password")
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
Use short-lived credentials where possible (IAM roles with STS, Workload Identity for GKE) rather than long-lived API keys or passwords.
|
|
177
|
+
|
|
178
|
+
### PII Handling
|
|
179
|
+
|
|
180
|
+
**PII detection at ingestion**
|
|
181
|
+
|
|
182
|
+
Scan incoming records for PII patterns before writing to bronze:
|
|
183
|
+
|
|
184
|
+
```python
|
|
185
|
+
import re
|
|
186
|
+
from dataclasses import dataclass
|
|
187
|
+
|
|
188
|
+
PII_PATTERNS = {
|
|
189
|
+
"email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
|
|
190
|
+
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
|
|
191
|
+
"credit_card": re.compile(r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b"),
|
|
192
|
+
"phone": re.compile(r"\b\+?1?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b"),
|
|
193
|
+
}
|
|
194
|
+
|
|
195
|
+
def detect_pii(record: dict) -> dict[str, list[str]]:
|
|
196
|
+
"""Return mapping of PII type to field names containing that PII."""
|
|
197
|
+
findings = {}
|
|
198
|
+
for field_name, value in record.items():
|
|
199
|
+
if not isinstance(value, str):
|
|
200
|
+
continue
|
|
201
|
+
for pii_type, pattern in PII_PATTERNS.items():
|
|
202
|
+
if pattern.search(value):
|
|
203
|
+
findings.setdefault(pii_type, []).append(field_name)
|
|
204
|
+
return findings
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
**PII masking strategies**
|
|
208
|
+
|
|
209
|
+
Choose the masking strategy based on the downstream use case:
|
|
210
|
+
|
|
211
|
+
| Strategy | Description | When to use |
|
|
212
|
+
|----------|-------------|-------------|
|
|
213
|
+
| Tokenization | Replace PII with consistent token (HMAC, UUID) | Join keys across datasets; token is stable |
|
|
214
|
+
| Hashing | One-way hash (SHA-256 + pepper) | Deduplication; cannot reverse |
|
|
215
|
+
| Masking | Replace with masked value (`****@example.com`) | Display in dashboards |
|
|
216
|
+
| Pseudonymization | Replace with synthetic but realistic value | ML training data |
|
|
217
|
+
| Suppression | Remove field entirely | When field has no analytical value |
|
|
218
|
+
|
|
219
|
+
```python
|
|
220
|
+
import hashlib
|
|
221
|
+
import hmac
|
|
222
|
+
|
|
223
|
+
PII_PEPPER = os.environ["PII_PEPPER"] # secret pepper stored in secrets manager
|
|
224
|
+
|
|
225
|
+
def tokenize_pii(value: str) -> str:
|
|
226
|
+
"""Generate a consistent, irreversible token for a PII value."""
|
|
227
|
+
return hmac.new(
|
|
228
|
+
PII_PEPPER.encode(),
|
|
229
|
+
value.lower().strip().encode(),
|
|
230
|
+
hashlib.sha256,
|
|
231
|
+
).hexdigest()
|
|
232
|
+
|
|
233
|
+
def mask_email(email: str) -> str:
|
|
234
|
+
"""Mask email address: j***@example.com"""
|
|
235
|
+
local, domain = email.split("@", 1)
|
|
236
|
+
return f"{local[0]}***@{domain}"
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**PII in logs**
|
|
240
|
+
|
|
241
|
+
Ensure PII never appears in pipeline logs. Implement a log sanitizer:
|
|
242
|
+
|
|
243
|
+
```python
|
|
244
|
+
import logging
|
|
245
|
+
|
|
246
|
+
SENSITIVE_PATTERNS = [
|
|
247
|
+
(re.compile(r'email=["\']?[\w.@+]+', re.I), 'email=***'),
|
|
248
|
+
(re.compile(r'password=["\']?[\S]+', re.I), 'password=***'),
|
|
249
|
+
(re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), '***-**-****'),
|
|
250
|
+
]
|
|
251
|
+
|
|
252
|
+
class PIIRedactingFilter(logging.Filter):
|
|
253
|
+
def filter(self, record: logging.LogRecord) -> bool:
|
|
254
|
+
message = str(record.getMessage())
|
|
255
|
+
for pattern, replacement in SENSITIVE_PATTERNS:
|
|
256
|
+
message = pattern.sub(replacement, message)
|
|
257
|
+
record.msg = message
|
|
258
|
+
record.args = ()
|
|
259
|
+
return True
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
### Audit Logging
|
|
263
|
+
|
|
264
|
+
Audit logs record who accessed what data, when, and what they did with it. They are required for GDPR, SOC 2, and HIPAA compliance.
|
|
265
|
+
|
|
266
|
+
**Audit event structure**
|
|
267
|
+
|
|
268
|
+
```python
|
|
269
|
+
from dataclasses import dataclass
|
|
270
|
+
from datetime import datetime
|
|
271
|
+
|
|
272
|
+
@dataclass
|
|
273
|
+
class AuditEvent:
|
|
274
|
+
timestamp: datetime
|
|
275
|
+
pipeline_id: str
|
|
276
|
+
pipeline_version: str
|
|
277
|
+
service_account: str
|
|
278
|
+
action: str # "read", "write", "delete", "transform"
|
|
279
|
+
resource: str # table name, topic, file path
|
|
280
|
+
record_count: int
|
|
281
|
+
pii_fields_accessed: list[str]
|
|
282
|
+
execution_id: str # correlation ID for tracing
|
|
283
|
+
environment: str
|
|
284
|
+
|
|
285
|
+
def emit_audit_event(event: AuditEvent) -> None:
|
|
286
|
+
"""Write audit event to immutable audit log (append-only, no delete permission)."""
|
|
287
|
+
audit_client.log(
|
|
288
|
+
log_name="data-pipeline-audit",
|
|
289
|
+
payload=dataclasses.asdict(event),
|
|
290
|
+
severity="INFO",
|
|
291
|
+
)
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**Immutable audit log requirements**:
|
|
295
|
+
- Stored in a separate, append-only log store from pipeline data
|
|
296
|
+
- Service accounts that write to audit logs cannot delete from them
|
|
297
|
+
- Retain audit logs for minimum 7 years (or regulatory requirement)
|
|
298
|
+
- Export to SIEM system for anomaly detection
|
|
299
|
+
|
|
300
|
+
**GDPR erasure audit trail**
|
|
301
|
+
|
|
302
|
+
When processing a right-to-erasure request, emit a deletion audit event for every store purged:
|
|
303
|
+
|
|
304
|
+
```python
|
|
305
|
+
def process_erasure_request(user_id: str, request_id: str) -> ErasureResult:
|
|
306
|
+
"""Process GDPR right-to-erasure request."""
|
|
307
|
+
stores_purged = []
|
|
308
|
+
for store in REGISTERED_PII_STORES:
|
|
309
|
+
count = store.delete_user_records(user_id)
|
|
310
|
+
if count > 0:
|
|
311
|
+
emit_audit_event(AuditEvent(
|
|
312
|
+
action="delete",
|
|
313
|
+
resource=store.name,
|
|
314
|
+
record_count=count,
|
|
315
|
+
pii_fields_accessed=store.pii_fields,
|
|
316
|
+
execution_id=request_id,
|
|
317
|
+
))
|
|
318
|
+
stores_purged.append(store.name)
|
|
319
|
+
|
|
320
|
+
return ErasureResult(
|
|
321
|
+
user_id=user_id,
|
|
322
|
+
request_id=request_id,
|
|
323
|
+
stores_purged=stores_purged,
|
|
324
|
+
completed_at=datetime.utcnow(),
|
|
325
|
+
)
|
|
326
|
+
```
|
|
@@ -0,0 +1,280 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-streaming-patterns
|
|
3
|
+
description: Event time vs processing time, windowing strategies, watermarks, and exactly-once semantics for streaming data pipelines
|
|
4
|
+
topics: [data-pipeline, streaming, event-time, processing-time, windowing, watermarks, exactly-once, flink, kafka-streams]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Streaming pipeline correctness depends on understanding the distinction between when an event occurred and when it was processed, handling late-arriving data with watermarks, and ensuring that each event is processed exactly once even in the face of failures. Getting these fundamentals wrong produces pipelines that appear to work but silently compute incorrect aggregations — bugs that surface only when business users notice wrong numbers weeks later.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use event time (when the event occurred) rather than processing time (when it was received) for all time-based aggregations. Define watermarks to bound how late an event can arrive before being dropped. Choose window types (tumbling, sliding, session) based on the business question. Implement exactly-once semantics at the processing layer and idempotent sinks to guarantee end-to-end exactly-once delivery.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Event Time vs. Processing Time
|
|
16
|
+
|
|
17
|
+
Every streaming event has two timestamps:
|
|
18
|
+
|
|
19
|
+
- **Event time**: When the event actually occurred in the real world (the user clicked, the transaction was authorized, the sensor fired)
|
|
20
|
+
- **Processing time**: When the event arrived at and was processed by the pipeline
|
|
21
|
+
|
|
22
|
+
These two times are rarely the same. Events are delayed by:
|
|
23
|
+
- Network latency (milliseconds to seconds)
|
|
24
|
+
- Mobile clients syncing offline activity (minutes to hours)
|
|
25
|
+
- Source system batch uploads (hours to days)
|
|
26
|
+
- Retries and error recovery (variable)
|
|
27
|
+
|
|
28
|
+
**Why event time matters**
|
|
29
|
+
|
|
30
|
+
Computing aggregations using processing time produces incorrect results when events arrive late. A mobile app that syncs a purchase made at 11:58 PM at 12:05 AM should count that purchase against the 11 PM hour, not the midnight hour — if using processing time, it appears in the wrong time window.
|
|
31
|
+
|
|
32
|
+
**Always use event time for**:
|
|
33
|
+
- Revenue and transaction aggregations
|
|
34
|
+
- User activity metrics
|
|
35
|
+
- Time-series analytics
|
|
36
|
+
- SLA measurement
|
|
37
|
+
|
|
38
|
+
**Processing time is appropriate for**:
|
|
39
|
+
- Pipeline throughput monitoring (how many events did the system process per second)
|
|
40
|
+
- System health metrics
|
|
41
|
+
- Deduplication windows within the pipeline itself (not based on business events)
|
|
42
|
+
|
|
43
|
+
### Watermarks
|
|
44
|
+
|
|
45
|
+
A watermark is the pipeline's estimate of how far behind the current event time is relative to the real world. It defines the maximum expected delay for events: events more than W minutes late are considered lost and their window closes.
|
|
46
|
+
|
|
47
|
+
**Watermark definition**
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
watermark(t) = max(event_time seen so far) - max_delay
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
If the pipeline has seen events with timestamps up to 14:55:00, and the configured max delay is 5 minutes, the watermark is 14:50:00. All windows ending before 14:50:00 are considered complete and can be emitted.
|
|
54
|
+
|
|
55
|
+
**Choosing watermark delay**
|
|
56
|
+
|
|
57
|
+
The watermark delay must be large enough to allow late-arriving events to arrive, but small enough to not indefinitely delay results:
|
|
58
|
+
|
|
59
|
+
| Event source | Typical delay | Recommended watermark |
|
|
60
|
+
|-------------|--------------|----------------------|
|
|
61
|
+
| Server-side events | < 1 second | 10–30 seconds |
|
|
62
|
+
| Mobile app events | < 5 minutes | 5–15 minutes |
|
|
63
|
+
| IoT devices | < 1 minute | 2–5 minutes |
|
|
64
|
+
| Batch file uploads | < 1 hour | 2–4 hours |
|
|
65
|
+
|
|
66
|
+
**Late event handling strategies**
|
|
67
|
+
|
|
68
|
+
When an event arrives after the watermark (too late for its window):
|
|
69
|
+
1. **Drop** (default): The event is discarded; the window has already closed
|
|
70
|
+
2. **Side output / late data stream**: Route to a separate stream for separate processing or DLQ
|
|
71
|
+
3. **Allowed lateness**: Keep the window open for an extra duration, recompute and emit updated results
|
|
72
|
+
|
|
73
|
+
```python
|
|
74
|
+
# Apache Flink: watermark with late data handling
|
|
75
|
+
stream
|
|
76
|
+
.assignTimestampsAndWatermarks(
|
|
77
|
+
WatermarkStrategy
|
|
78
|
+
.forBoundedOutOfOrderness(Duration.ofMinutes(5)) # 5-minute watermark
|
|
79
|
+
.withTimestampAssigner(lambda e, _: e['event_timestamp'])
|
|
80
|
+
)
|
|
81
|
+
.key_by(lambda e: e['merchant_id'])
|
|
82
|
+
.window(TumblingEventTimeWindows.of(Time.hours(1)))
|
|
83
|
+
.allowed_lateness(Duration.ofMinutes(30)) # keep window 30 min past watermark
|
|
84
|
+
.side_output_late_data(late_output_tag) # capture late events
|
|
85
|
+
.aggregate(RevenueAggregator())
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Window Types
|
|
89
|
+
|
|
90
|
+
Choose the window type based on the business question:
|
|
91
|
+
|
|
92
|
+
**Tumbling windows (fixed, non-overlapping)**
|
|
93
|
+
|
|
94
|
+
Each event belongs to exactly one window. Windows are discrete and consecutive.
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
|--- 10:00 – 11:00 ---|--- 11:00 – 12:00 ---|--- 12:00 – 13:00 ---|
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Use for: Hourly/daily aggregations, fixed-period KPIs, batch-style streaming.
|
|
101
|
+
|
|
102
|
+
```python
|
|
103
|
+
# Transactions per hour
|
|
104
|
+
stream.window(TumblingEventTimeWindows.of(Time.hours(1)))
|
|
105
|
+
.aggregate(CountAggregator())
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
**Sliding windows (fixed size, overlapping)**
|
|
109
|
+
|
|
110
|
+
Windows slide at a fixed interval. Each event may belong to multiple windows.
|
|
111
|
+
|
|
112
|
+
```
|
|
113
|
+
Window size=60min, slide=15min:
|
|
114
|
+
|--- 10:00 – 11:00 ---|
|
|
115
|
+
|--- 10:15 – 11:15 ---|
|
|
116
|
+
|--- 10:30 – 11:30 ---|
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Use for: Moving averages, rolling metrics, anomaly detection over recent history.
|
|
120
|
+
|
|
121
|
+
```python
|
|
122
|
+
# 1-hour rolling average, updated every 15 minutes
|
|
123
|
+
stream.window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(15)))
|
|
124
|
+
.aggregate(MovingAverageAggregator())
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Session windows (dynamic, gap-based)**
|
|
128
|
+
|
|
129
|
+
Windows close after a period of inactivity. Window size varies per user.
|
|
130
|
+
|
|
131
|
+
```
|
|
132
|
+
User A: |--active--| gap |--active--|
|
|
133
|
+
User B: |----active----|
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Use for: User session analysis, visit duration, activity burst detection.
|
|
137
|
+
|
|
138
|
+
```python
|
|
139
|
+
# Session window with 30-minute inactivity gap
|
|
140
|
+
stream.key_by(lambda e: e['user_id'])
|
|
141
|
+
.window(EventTimeSessionWindows.with_gap(Time.minutes(30)))
|
|
142
|
+
.aggregate(SessionAggregator())
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
**Global windows**
|
|
146
|
+
|
|
147
|
+
Unbounded single window; requires custom trigger to emit results.
|
|
148
|
+
|
|
149
|
+
Use for: Stream-table joins, custom trigger-based aggregations.
|
|
150
|
+
|
|
151
|
+
### Exactly-Once Semantics
|
|
152
|
+
|
|
153
|
+
Exactly-once means each event is processed and its effects are reflected in the output exactly once, even if the processing system fails and restarts.
|
|
154
|
+
|
|
155
|
+
**Delivery guarantees**
|
|
156
|
+
|
|
157
|
+
- **At-most-once**: Events may be lost, never duplicated. Simplest but loses data on failure.
|
|
158
|
+
- **At-least-once**: Events are never lost but may be duplicated. Requires idempotent consumers.
|
|
159
|
+
- **Exactly-once**: Events are neither lost nor duplicated. Most complex.
|
|
160
|
+
|
|
161
|
+
Most production streaming systems combine at-least-once delivery with idempotent sinks to achieve end-to-end exactly-once behavior.
|
|
162
|
+
|
|
163
|
+
**Flink exactly-once checkpointing**
|
|
164
|
+
|
|
165
|
+
Apache Flink implements exactly-once via distributed checkpointing:
|
|
166
|
+
|
|
167
|
+
```python
|
|
168
|
+
env = StreamExecutionEnvironment.get_execution_environment()
|
|
169
|
+
env.enable_checkpointing(60000) # checkpoint every 60 seconds
|
|
170
|
+
|
|
171
|
+
checkpoint_config = env.get_checkpoint_config()
|
|
172
|
+
checkpoint_config.set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
|
|
173
|
+
checkpoint_config.set_min_pause_between_checkpoints(30000) # 30s between checkpoints
|
|
174
|
+
checkpoint_config.set_checkpoint_timeout(120000) # fail if checkpoint takes >2min
|
|
175
|
+
checkpoint_config.enable_externalized_checkpoints(
|
|
176
|
+
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
|
|
177
|
+
)
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
Flink checkpoints snapshot the full operator state to durable storage (S3, HDFS). On failure, the job restarts from the last successful checkpoint and replays events from Kafka from the committed offset.
|
|
181
|
+
|
|
182
|
+
**Kafka Streams exactly-once**
|
|
183
|
+
|
|
184
|
+
```python
|
|
185
|
+
props = {
|
|
186
|
+
"processing.guarantee": "exactly_once_v2", # EOS v2 (Kafka 2.5+)
|
|
187
|
+
"bootstrap.servers": "localhost:9092",
|
|
188
|
+
"application.id": "payments-processor",
|
|
189
|
+
}
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
Kafka Streams EOS uses transactions: reads from input topics, writes to output topics, and commits consumer offsets atomically in a single Kafka transaction. Either all three happen or none do.
|
|
193
|
+
|
|
194
|
+
**Idempotent sinks for exactly-once delivery**
|
|
195
|
+
|
|
196
|
+
Even with exactly-once processing semantics, the sink (database, warehouse) must be idempotent to achieve end-to-end exactly-once:
|
|
197
|
+
|
|
198
|
+
```python
|
|
199
|
+
# PostgreSQL: use INSERT ON CONFLICT for idempotent writes
|
|
200
|
+
def write_to_postgres(records: list[dict], conn) -> None:
|
|
201
|
+
with conn.cursor() as cur:
|
|
202
|
+
for record in records:
|
|
203
|
+
cur.execute("""
|
|
204
|
+
INSERT INTO transactions (id, amount, currency, processed_at)
|
|
205
|
+
VALUES (%(id)s, %(amount)s, %(currency)s, %(processed_at)s)
|
|
206
|
+
ON CONFLICT (id) DO UPDATE SET
|
|
207
|
+
amount = EXCLUDED.amount,
|
|
208
|
+
currency = EXCLUDED.currency,
|
|
209
|
+
processed_at = EXCLUDED.processed_at
|
|
210
|
+
""", record)
|
|
211
|
+
|
|
212
|
+
# BigQuery: use MERGE for idempotent loads
|
|
213
|
+
MERGE_SQL = """
|
|
214
|
+
MERGE silver_transactions T
|
|
215
|
+
USING (SELECT * FROM UNNEST(@records)) S
|
|
216
|
+
ON T.transaction_id = S.transaction_id
|
|
217
|
+
WHEN MATCHED THEN UPDATE SET
|
|
218
|
+
amount = S.amount, currency = S.currency
|
|
219
|
+
WHEN NOT MATCHED THEN INSERT ROW
|
|
220
|
+
"""
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### State Management in Streaming
|
|
224
|
+
|
|
225
|
+
Stateful streaming operations (aggregations, joins, deduplication) require managing state across events:
|
|
226
|
+
|
|
227
|
+
**State backend selection**
|
|
228
|
+
|
|
229
|
+
| Backend | Use case | Tradeoffs |
|
|
230
|
+
|---------|----------|-----------|
|
|
231
|
+
| In-memory (HashMap) | Dev/test, small state | Lost on failure, no persistence |
|
|
232
|
+
| RocksDB (embedded) | Production, large state | Persisted locally, spills to disk |
|
|
233
|
+
| Remote (Redis, DynamoDB) | Cross-instance shared state | Network latency, operational cost |
|
|
234
|
+
|
|
235
|
+
**State TTL (Time-To-Live)**
|
|
236
|
+
|
|
237
|
+
Always set TTL on state to prevent unbounded state growth:
|
|
238
|
+
|
|
239
|
+
```python
|
|
240
|
+
# Flink: expire deduplication state after 24 hours
|
|
241
|
+
state_ttl_config = StateTtlConfig \
|
|
242
|
+
.new_builder(Time.hours(24)) \
|
|
243
|
+
.set_update_type(StateTtlConfig.UpdateType.OnReadAndWrite) \
|
|
244
|
+
.set_state_visibility(StateTtlConfig.StateVisibility.NeverReturnExpired) \
|
|
245
|
+
.build()
|
|
246
|
+
|
|
247
|
+
dedup_state = RuntimeContext.get_state(
|
|
248
|
+
ValueStateDescriptor("seen_events", Types.STRING())
|
|
249
|
+
)
|
|
250
|
+
dedup_state.enable_time_to_live(state_ttl_config)
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
**Deduplication using state**
|
|
254
|
+
|
|
255
|
+
```python
|
|
256
|
+
class DeduplicationFunction(KeyedProcessFunction):
|
|
257
|
+
def open(self, parameters):
|
|
258
|
+
state_descriptor = ValueStateDescriptor("seen", Types.BOOLEAN())
|
|
259
|
+
ttl_config = StateTtlConfig.new_builder(Time.hours(24)).build()
|
|
260
|
+
state_descriptor.enable_time_to_live(ttl_config)
|
|
261
|
+
self.seen = self.runtime_context.get_state(state_descriptor)
|
|
262
|
+
|
|
263
|
+
def process_element(self, event, ctx, out):
|
|
264
|
+
if self.seen.value() is None:
|
|
265
|
+
self.seen.update(True)
|
|
266
|
+
out.collect(event) # first occurrence: emit
|
|
267
|
+
# else: duplicate, silently drop
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
### Streaming Pipeline Operational Metrics
|
|
271
|
+
|
|
272
|
+
Monitor these metrics on every production streaming pipeline:
|
|
273
|
+
|
|
274
|
+
- **Consumer lag** (Kafka): events behind the latest offset per partition
|
|
275
|
+
- **Event time lag**: difference between watermark and wall clock
|
|
276
|
+
- **Checkpoint duration and success rate** (Flink)
|
|
277
|
+
- **Late event rate**: percentage of events arriving after watermark
|
|
278
|
+
- **DLQ write rate**: events being routed to dead-letter queue
|
|
279
|
+
- **Throughput** (events/sec, bytes/sec): actual vs. capacity headroom
|
|
280
|
+
- **State size**: total operator state in bytes; watch for unbounded growth
|