@zigrivers/scaffold 3.8.0 → 3.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +75 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +167 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,326 @@
1
+ ---
2
+ name: data-pipeline-security
3
+ description: PII handling, encryption at rest and in transit, access control, and audit logging for data pipelines
4
+ topics: [data-pipeline, security, pii, encryption, access-control, audit-logging, gdpr, data-masking]
5
+ ---
6
+
7
+ Data pipelines are high-value targets for security incidents because they aggregate sensitive data from multiple systems and often run with broad service account permissions. A pipeline that ingests payment data, user PII, and health records and writes to a central warehouse becomes a single point of compromise. Security must be designed into the pipeline from the start — encryption, least-privilege access, PII handling, and audit logging cannot be retrofitted economically.
8
+
9
+ ## Summary
10
+
11
+ Encrypt all data at rest and in transit using current standards (TLS 1.2+, AES-256). Apply least-privilege access: each pipeline service account reads only the sources it needs and writes only to its designated destinations. Detect and classify all PII fields at ingestion; mask, tokenize, or drop PII before it reaches analytical layers. Log all data access and transformation events to an immutable audit trail. Test security controls in CI.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Encryption at Rest
16
+
17
+ Every data store the pipeline writes to must encrypt data at rest:
18
+
19
+ **Object storage (S3, GCS, Azure Blob)**
20
+
21
+ ```python
22
+ # S3: server-side encryption with AWS KMS
23
+ s3_client = boto3.client("s3")
24
+ s3_client.put_object(
25
+ Bucket="raw-data",
26
+ Key="payments/2024-01-15/transactions.parquet",
27
+ Body=parquet_bytes,
28
+ ServerSideEncryption="aws:kms",
29
+ SSEKMSKeyId="arn:aws:kms:us-east-1:123456789:key/mrk-abc123",
30
+ )
31
+ ```
32
+
33
+ Use customer-managed KMS keys (CMKs), not AWS-managed keys, for data subject to regulatory requirements. CMKs allow key rotation control and key deletion for data erasure compliance (GDPR right to erasure via key deletion).
34
+
35
+ **Database encryption**
36
+
37
+ Configure encryption at rest for all databases the pipeline reads from or writes to:
38
+ - PostgreSQL: `pg_tde` extension or filesystem-level encryption (LUKS, dm-crypt)
39
+ - BigQuery: CMEK (customer-managed encryption keys) via `--kms_key` parameter
40
+ - Snowflake: Tri-Secret Secure (Snowflake + customer key + Snowflake-managed key)
41
+ - Kafka: enable log encryption using KMS-backed encryption keys
42
+
43
+ **Parquet/Avro field-level encryption**
44
+
45
+ For sensitive fields, apply field-level encryption in addition to file-level encryption. This allows analytic queries on encrypted files while keeping specific columns (PII) unreadable without the column-level key:
46
+
47
+ ```python
48
+ from pyarrow.parquet import ParquetFile
49
+ from pyarrow import compute as pc
50
+
51
+ # Write with field-level encryption using Apache Parquet encryption
52
+ encryption_config = pq.EncryptionConfiguration(
53
+ footer_key="footer_key_id",
54
+ column_keys={
55
+ "user_pii_key": ["email", "phone", "ssn_hash"],
56
+ "default_key": None, # all other columns use footer key
57
+ },
58
+ )
59
+ pq.write_table(table, "output.parquet", encryption_properties=encryption_config)
60
+ ```
61
+
62
+ ### Encryption in Transit
63
+
64
+ All data in transit must use TLS 1.2 or higher. Disable older protocols and weak cipher suites:
65
+
66
+ **Kafka TLS configuration**
67
+
68
+ ```properties
69
+ # Kafka broker: enforce TLS
70
+ listeners=SSL://:9093
71
+ ssl.keystore.location=/ssl/kafka.server.keystore.jks
72
+ ssl.keystore.password=${KEYSTORE_PASSWORD}
73
+ ssl.truststore.location=/ssl/kafka.server.truststore.jks
74
+ ssl.truststore.password=${TRUSTSTORE_PASSWORD}
75
+ ssl.client.auth=required
76
+ ssl.enabled.protocols=TLSv1.2,TLSv1.3
77
+ ssl.cipher.suites=TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256
78
+
79
+ # Producer/consumer: mutual TLS
80
+ security.protocol=SSL
81
+ ssl.keystore.location=/ssl/client.keystore.jks
82
+ ssl.keystore.password=${CLIENT_KEYSTORE_PASSWORD}
83
+ ```
84
+
85
+ **API and database connections**
86
+
87
+ ```python
88
+ # PostgreSQL: require SSL
89
+ conn = psycopg2.connect(
90
+ host="db.example.com",
91
+ dbname="source_db",
92
+ user=os.environ["DB_USER"],
93
+ password=os.environ["DB_PASSWORD"],
94
+ sslmode="verify-full", # reject if cert invalid
95
+ sslrootcert="/certs/ca.pem",
96
+ )
97
+
98
+ # HTTP clients: verify certificates, no plain HTTP
99
+ import httpx
100
+ client = httpx.Client(verify=True) # default; never set verify=False in production
101
+ ```
102
+
103
+ ### Least-Privilege Access Control
104
+
105
+ Each pipeline component should have the minimum permissions necessary to perform its function. Service accounts should never have admin-level permissions.
106
+
107
+ **AWS IAM policy for pipeline service account**
108
+
109
+ ```json
110
+ {
111
+ "Version": "2012-10-17",
112
+ "Statement": [
113
+ {
114
+ "Effect": "Allow",
115
+ "Action": ["s3:GetObject", "s3:ListBucket"],
116
+ "Resource": [
117
+ "arn:aws:s3:::raw-data",
118
+ "arn:aws:s3:::raw-data/payments/*"
119
+ ]
120
+ },
121
+ {
122
+ "Effect": "Allow",
123
+ "Action": ["s3:PutObject"],
124
+ "Resource": "arn:aws:s3:::silver-data/transactions/*"
125
+ },
126
+ {
127
+ "Effect": "Deny",
128
+ "Action": ["s3:DeleteObject", "s3:DeleteBucket"],
129
+ "Resource": "*"
130
+ }
131
+ ]
132
+ }
133
+ ```
134
+
135
+ **BigQuery column-level access control**
136
+
137
+ ```python
138
+ # Grant access to non-PII columns only for analytics users
139
+ from google.cloud import bigquery
140
+
141
+ client = bigquery.Client()
142
+ table_ref = client.get_table("project.silver.transactions")
143
+
144
+ # Column-level security policy: analytics role cannot read PII columns
145
+ table_ref.schema = [
146
+ bigquery.SchemaField("transaction_id", "STRING"),
147
+ bigquery.SchemaField("amount", "FLOAT64"),
148
+ bigquery.SchemaField("currency", "STRING"),
149
+ bigquery.SchemaField("email", "STRING",
150
+ policy_tags=bigquery.PolicyTagList(["projects/project/locations/us/taxonomies/123/policyTags/456"])),
151
+ ]
152
+ ```
153
+
154
+ **Secret management**
155
+
156
+ Never store credentials in pipeline code, DAG files, config files, or environment variables on shared machines:
157
+
158
+ ```python
159
+ # Bad: hardcoded in code
160
+ DB_PASSWORD = "s3cr3t"
161
+
162
+ # Bad: plaintext in config file
163
+ # config/prod.yaml: db_password: s3cr3t
164
+
165
+ # Good: fetch from secrets manager at runtime
166
+ import boto3
167
+
168
+ def get_secret(secret_name: str) -> str:
169
+ client = boto3.client("secretsmanager", region_name="us-east-1")
170
+ response = client.get_secret_value(SecretId=secret_name)
171
+ return response["SecretString"]
172
+
173
+ DB_PASSWORD = get_secret("prod/pipeline/db-password")
174
+ ```
175
+
176
+ Use short-lived credentials where possible (IAM roles with STS, Workload Identity for GKE) rather than long-lived API keys or passwords.
177
+
178
+ ### PII Handling
179
+
180
+ **PII detection at ingestion**
181
+
182
+ Scan incoming records for PII patterns before writing to bronze:
183
+
184
+ ```python
185
+ import re
186
+ from dataclasses import dataclass
187
+
188
+ PII_PATTERNS = {
189
+ "email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
190
+ "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
191
+ "credit_card": re.compile(r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b"),
192
+ "phone": re.compile(r"\b\+?1?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b"),
193
+ }
194
+
195
+ def detect_pii(record: dict) -> dict[str, list[str]]:
196
+ """Return mapping of PII type to field names containing that PII."""
197
+ findings = {}
198
+ for field_name, value in record.items():
199
+ if not isinstance(value, str):
200
+ continue
201
+ for pii_type, pattern in PII_PATTERNS.items():
202
+ if pattern.search(value):
203
+ findings.setdefault(pii_type, []).append(field_name)
204
+ return findings
205
+ ```
206
+
207
+ **PII masking strategies**
208
+
209
+ Choose the masking strategy based on the downstream use case:
210
+
211
+ | Strategy | Description | When to use |
212
+ |----------|-------------|-------------|
213
+ | Tokenization | Replace PII with consistent token (HMAC, UUID) | Join keys across datasets; token is stable |
214
+ | Hashing | One-way hash (SHA-256 + pepper) | Deduplication; cannot reverse |
215
+ | Masking | Replace with masked value (`****@example.com`) | Display in dashboards |
216
+ | Pseudonymization | Replace with synthetic but realistic value | ML training data |
217
+ | Suppression | Remove field entirely | When field has no analytical value |
218
+
219
+ ```python
220
+ import hashlib
221
+ import hmac
222
+
223
+ PII_PEPPER = os.environ["PII_PEPPER"] # secret pepper stored in secrets manager
224
+
225
+ def tokenize_pii(value: str) -> str:
226
+ """Generate a consistent, irreversible token for a PII value."""
227
+ return hmac.new(
228
+ PII_PEPPER.encode(),
229
+ value.lower().strip().encode(),
230
+ hashlib.sha256,
231
+ ).hexdigest()
232
+
233
+ def mask_email(email: str) -> str:
234
+ """Mask email address: j***@example.com"""
235
+ local, domain = email.split("@", 1)
236
+ return f"{local[0]}***@{domain}"
237
+ ```
238
+
239
+ **PII in logs**
240
+
241
+ Ensure PII never appears in pipeline logs. Implement a log sanitizer:
242
+
243
+ ```python
244
+ import logging
245
+
246
+ SENSITIVE_PATTERNS = [
247
+ (re.compile(r'email=["\']?[\w.@+]+', re.I), 'email=***'),
248
+ (re.compile(r'password=["\']?[\S]+', re.I), 'password=***'),
249
+ (re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), '***-**-****'),
250
+ ]
251
+
252
+ class PIIRedactingFilter(logging.Filter):
253
+ def filter(self, record: logging.LogRecord) -> bool:
254
+ message = str(record.getMessage())
255
+ for pattern, replacement in SENSITIVE_PATTERNS:
256
+ message = pattern.sub(replacement, message)
257
+ record.msg = message
258
+ record.args = ()
259
+ return True
260
+ ```
261
+
262
+ ### Audit Logging
263
+
264
+ Audit logs record who accessed what data, when, and what they did with it. They are required for GDPR, SOC 2, and HIPAA compliance.
265
+
266
+ **Audit event structure**
267
+
268
+ ```python
269
+ from dataclasses import dataclass
270
+ from datetime import datetime
271
+
272
+ @dataclass
273
+ class AuditEvent:
274
+ timestamp: datetime
275
+ pipeline_id: str
276
+ pipeline_version: str
277
+ service_account: str
278
+ action: str # "read", "write", "delete", "transform"
279
+ resource: str # table name, topic, file path
280
+ record_count: int
281
+ pii_fields_accessed: list[str]
282
+ execution_id: str # correlation ID for tracing
283
+ environment: str
284
+
285
+ def emit_audit_event(event: AuditEvent) -> None:
286
+ """Write audit event to immutable audit log (append-only, no delete permission)."""
287
+ audit_client.log(
288
+ log_name="data-pipeline-audit",
289
+ payload=dataclasses.asdict(event),
290
+ severity="INFO",
291
+ )
292
+ ```
293
+
294
+ **Immutable audit log requirements**:
295
+ - Stored in a separate, append-only log store from pipeline data
296
+ - Service accounts that write to audit logs cannot delete from them
297
+ - Retain audit logs for minimum 7 years (or regulatory requirement)
298
+ - Export to SIEM system for anomaly detection
299
+
300
+ **GDPR erasure audit trail**
301
+
302
+ When processing a right-to-erasure request, emit a deletion audit event for every store purged:
303
+
304
+ ```python
305
+ def process_erasure_request(user_id: str, request_id: str) -> ErasureResult:
306
+ """Process GDPR right-to-erasure request."""
307
+ stores_purged = []
308
+ for store in REGISTERED_PII_STORES:
309
+ count = store.delete_user_records(user_id)
310
+ if count > 0:
311
+ emit_audit_event(AuditEvent(
312
+ action="delete",
313
+ resource=store.name,
314
+ record_count=count,
315
+ pii_fields_accessed=store.pii_fields,
316
+ execution_id=request_id,
317
+ ))
318
+ stores_purged.append(store.name)
319
+
320
+ return ErasureResult(
321
+ user_id=user_id,
322
+ request_id=request_id,
323
+ stores_purged=stores_purged,
324
+ completed_at=datetime.utcnow(),
325
+ )
326
+ ```
@@ -0,0 +1,280 @@
1
+ ---
2
+ name: data-pipeline-streaming-patterns
3
+ description: Event time vs processing time, windowing strategies, watermarks, and exactly-once semantics for streaming data pipelines
4
+ topics: [data-pipeline, streaming, event-time, processing-time, windowing, watermarks, exactly-once, flink, kafka-streams]
5
+ ---
6
+
7
+ Streaming pipeline correctness depends on understanding the distinction between when an event occurred and when it was processed, handling late-arriving data with watermarks, and ensuring that each event is processed exactly once even in the face of failures. Getting these fundamentals wrong produces pipelines that appear to work but silently compute incorrect aggregations — bugs that surface only when business users notice wrong numbers weeks later.
8
+
9
+ ## Summary
10
+
11
+ Use event time (when the event occurred) rather than processing time (when it was received) for all time-based aggregations. Define watermarks to bound how late an event can arrive before being dropped. Choose window types (tumbling, sliding, session) based on the business question. Implement exactly-once semantics at the processing layer and idempotent sinks to guarantee end-to-end exactly-once delivery.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Event Time vs. Processing Time
16
+
17
+ Every streaming event has two timestamps:
18
+
19
+ - **Event time**: When the event actually occurred in the real world (the user clicked, the transaction was authorized, the sensor fired)
20
+ - **Processing time**: When the event arrived at and was processed by the pipeline
21
+
22
+ These two times are rarely the same. Events are delayed by:
23
+ - Network latency (milliseconds to seconds)
24
+ - Mobile clients syncing offline activity (minutes to hours)
25
+ - Source system batch uploads (hours to days)
26
+ - Retries and error recovery (variable)
27
+
28
+ **Why event time matters**
29
+
30
+ Computing aggregations using processing time produces incorrect results when events arrive late. A mobile app that syncs a purchase made at 11:58 PM at 12:05 AM should count that purchase against the 11 PM hour, not the midnight hour — if using processing time, it appears in the wrong time window.
31
+
32
+ **Always use event time for**:
33
+ - Revenue and transaction aggregations
34
+ - User activity metrics
35
+ - Time-series analytics
36
+ - SLA measurement
37
+
38
+ **Processing time is appropriate for**:
39
+ - Pipeline throughput monitoring (how many events did the system process per second)
40
+ - System health metrics
41
+ - Deduplication windows within the pipeline itself (not based on business events)
42
+
43
+ ### Watermarks
44
+
45
+ A watermark is the pipeline's estimate of how far behind the current event time is relative to the real world. It defines the maximum expected delay for events: events more than W minutes late are considered lost and their window closes.
46
+
47
+ **Watermark definition**
48
+
49
+ ```
50
+ watermark(t) = max(event_time seen so far) - max_delay
51
+ ```
52
+
53
+ If the pipeline has seen events with timestamps up to 14:55:00, and the configured max delay is 5 minutes, the watermark is 14:50:00. All windows ending before 14:50:00 are considered complete and can be emitted.
54
+
55
+ **Choosing watermark delay**
56
+
57
+ The watermark delay must be large enough to allow late-arriving events to arrive, but small enough to not indefinitely delay results:
58
+
59
+ | Event source | Typical delay | Recommended watermark |
60
+ |-------------|--------------|----------------------|
61
+ | Server-side events | < 1 second | 10–30 seconds |
62
+ | Mobile app events | < 5 minutes | 5–15 minutes |
63
+ | IoT devices | < 1 minute | 2–5 minutes |
64
+ | Batch file uploads | < 1 hour | 2–4 hours |
65
+
66
+ **Late event handling strategies**
67
+
68
+ When an event arrives after the watermark (too late for its window):
69
+ 1. **Drop** (default): The event is discarded; the window has already closed
70
+ 2. **Side output / late data stream**: Route to a separate stream for separate processing or DLQ
71
+ 3. **Allowed lateness**: Keep the window open for an extra duration, recompute and emit updated results
72
+
73
+ ```python
74
+ # Apache Flink: watermark with late data handling
75
+ stream
76
+ .assignTimestampsAndWatermarks(
77
+ WatermarkStrategy
78
+ .forBoundedOutOfOrderness(Duration.ofMinutes(5)) # 5-minute watermark
79
+ .withTimestampAssigner(lambda e, _: e['event_timestamp'])
80
+ )
81
+ .key_by(lambda e: e['merchant_id'])
82
+ .window(TumblingEventTimeWindows.of(Time.hours(1)))
83
+ .allowed_lateness(Duration.ofMinutes(30)) # keep window 30 min past watermark
84
+ .side_output_late_data(late_output_tag) # capture late events
85
+ .aggregate(RevenueAggregator())
86
+ ```
87
+
88
+ ### Window Types
89
+
90
+ Choose the window type based on the business question:
91
+
92
+ **Tumbling windows (fixed, non-overlapping)**
93
+
94
+ Each event belongs to exactly one window. Windows are discrete and consecutive.
95
+
96
+ ```
97
+ |--- 10:00 – 11:00 ---|--- 11:00 – 12:00 ---|--- 12:00 – 13:00 ---|
98
+ ```
99
+
100
+ Use for: Hourly/daily aggregations, fixed-period KPIs, batch-style streaming.
101
+
102
+ ```python
103
+ # Transactions per hour
104
+ stream.window(TumblingEventTimeWindows.of(Time.hours(1)))
105
+ .aggregate(CountAggregator())
106
+ ```
107
+
108
+ **Sliding windows (fixed size, overlapping)**
109
+
110
+ Windows slide at a fixed interval. Each event may belong to multiple windows.
111
+
112
+ ```
113
+ Window size=60min, slide=15min:
114
+ |--- 10:00 – 11:00 ---|
115
+ |--- 10:15 – 11:15 ---|
116
+ |--- 10:30 – 11:30 ---|
117
+ ```
118
+
119
+ Use for: Moving averages, rolling metrics, anomaly detection over recent history.
120
+
121
+ ```python
122
+ # 1-hour rolling average, updated every 15 minutes
123
+ stream.window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(15)))
124
+ .aggregate(MovingAverageAggregator())
125
+ ```
126
+
127
+ **Session windows (dynamic, gap-based)**
128
+
129
+ Windows close after a period of inactivity. Window size varies per user.
130
+
131
+ ```
132
+ User A: |--active--| gap |--active--|
133
+ User B: |----active----|
134
+ ```
135
+
136
+ Use for: User session analysis, visit duration, activity burst detection.
137
+
138
+ ```python
139
+ # Session window with 30-minute inactivity gap
140
+ stream.key_by(lambda e: e['user_id'])
141
+ .window(EventTimeSessionWindows.with_gap(Time.minutes(30)))
142
+ .aggregate(SessionAggregator())
143
+ ```
144
+
145
+ **Global windows**
146
+
147
+ Unbounded single window; requires custom trigger to emit results.
148
+
149
+ Use for: Stream-table joins, custom trigger-based aggregations.
150
+
151
+ ### Exactly-Once Semantics
152
+
153
+ Exactly-once means each event is processed and its effects are reflected in the output exactly once, even if the processing system fails and restarts.
154
+
155
+ **Delivery guarantees**
156
+
157
+ - **At-most-once**: Events may be lost, never duplicated. Simplest but loses data on failure.
158
+ - **At-least-once**: Events are never lost but may be duplicated. Requires idempotent consumers.
159
+ - **Exactly-once**: Events are neither lost nor duplicated. Most complex.
160
+
161
+ Most production streaming systems combine at-least-once delivery with idempotent sinks to achieve end-to-end exactly-once behavior.
162
+
163
+ **Flink exactly-once checkpointing**
164
+
165
+ Apache Flink implements exactly-once via distributed checkpointing:
166
+
167
+ ```python
168
+ env = StreamExecutionEnvironment.get_execution_environment()
169
+ env.enable_checkpointing(60000) # checkpoint every 60 seconds
170
+
171
+ checkpoint_config = env.get_checkpoint_config()
172
+ checkpoint_config.set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
173
+ checkpoint_config.set_min_pause_between_checkpoints(30000) # 30s between checkpoints
174
+ checkpoint_config.set_checkpoint_timeout(120000) # fail if checkpoint takes >2min
175
+ checkpoint_config.enable_externalized_checkpoints(
176
+ ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
177
+ )
178
+ ```
179
+
180
+ Flink checkpoints snapshot the full operator state to durable storage (S3, HDFS). On failure, the job restarts from the last successful checkpoint and replays events from Kafka from the committed offset.
181
+
182
+ **Kafka Streams exactly-once**
183
+
184
+ ```python
185
+ props = {
186
+ "processing.guarantee": "exactly_once_v2", # EOS v2 (Kafka 2.5+)
187
+ "bootstrap.servers": "localhost:9092",
188
+ "application.id": "payments-processor",
189
+ }
190
+ ```
191
+
192
+ Kafka Streams EOS uses transactions: reads from input topics, writes to output topics, and commits consumer offsets atomically in a single Kafka transaction. Either all three happen or none do.
193
+
194
+ **Idempotent sinks for exactly-once delivery**
195
+
196
+ Even with exactly-once processing semantics, the sink (database, warehouse) must be idempotent to achieve end-to-end exactly-once:
197
+
198
+ ```python
199
+ # PostgreSQL: use INSERT ON CONFLICT for idempotent writes
200
+ def write_to_postgres(records: list[dict], conn) -> None:
201
+ with conn.cursor() as cur:
202
+ for record in records:
203
+ cur.execute("""
204
+ INSERT INTO transactions (id, amount, currency, processed_at)
205
+ VALUES (%(id)s, %(amount)s, %(currency)s, %(processed_at)s)
206
+ ON CONFLICT (id) DO UPDATE SET
207
+ amount = EXCLUDED.amount,
208
+ currency = EXCLUDED.currency,
209
+ processed_at = EXCLUDED.processed_at
210
+ """, record)
211
+
212
+ # BigQuery: use MERGE for idempotent loads
213
+ MERGE_SQL = """
214
+ MERGE silver_transactions T
215
+ USING (SELECT * FROM UNNEST(@records)) S
216
+ ON T.transaction_id = S.transaction_id
217
+ WHEN MATCHED THEN UPDATE SET
218
+ amount = S.amount, currency = S.currency
219
+ WHEN NOT MATCHED THEN INSERT ROW
220
+ """
221
+ ```
222
+
223
+ ### State Management in Streaming
224
+
225
+ Stateful streaming operations (aggregations, joins, deduplication) require managing state across events:
226
+
227
+ **State backend selection**
228
+
229
+ | Backend | Use case | Tradeoffs |
230
+ |---------|----------|-----------|
231
+ | In-memory (HashMap) | Dev/test, small state | Lost on failure, no persistence |
232
+ | RocksDB (embedded) | Production, large state | Persisted locally, spills to disk |
233
+ | Remote (Redis, DynamoDB) | Cross-instance shared state | Network latency, operational cost |
234
+
235
+ **State TTL (Time-To-Live)**
236
+
237
+ Always set TTL on state to prevent unbounded state growth:
238
+
239
+ ```python
240
+ # Flink: expire deduplication state after 24 hours
241
+ state_ttl_config = StateTtlConfig \
242
+ .new_builder(Time.hours(24)) \
243
+ .set_update_type(StateTtlConfig.UpdateType.OnReadAndWrite) \
244
+ .set_state_visibility(StateTtlConfig.StateVisibility.NeverReturnExpired) \
245
+ .build()
246
+
247
+ dedup_state = RuntimeContext.get_state(
248
+ ValueStateDescriptor("seen_events", Types.STRING())
249
+ )
250
+ dedup_state.enable_time_to_live(state_ttl_config)
251
+ ```
252
+
253
+ **Deduplication using state**
254
+
255
+ ```python
256
+ class DeduplicationFunction(KeyedProcessFunction):
257
+ def open(self, parameters):
258
+ state_descriptor = ValueStateDescriptor("seen", Types.BOOLEAN())
259
+ ttl_config = StateTtlConfig.new_builder(Time.hours(24)).build()
260
+ state_descriptor.enable_time_to_live(ttl_config)
261
+ self.seen = self.runtime_context.get_state(state_descriptor)
262
+
263
+ def process_element(self, event, ctx, out):
264
+ if self.seen.value() is None:
265
+ self.seen.update(True)
266
+ out.collect(event) # first occurrence: emit
267
+ # else: duplicate, silently drop
268
+ ```
269
+
270
+ ### Streaming Pipeline Operational Metrics
271
+
272
+ Monitor these metrics on every production streaming pipeline:
273
+
274
+ - **Consumer lag** (Kafka): events behind the latest offset per partition
275
+ - **Event time lag**: difference between watermark and wall clock
276
+ - **Checkpoint duration and success rate** (Flink)
277
+ - **Late event rate**: percentage of events arriving after watermark
278
+ - **DLQ write rate**: events being routed to dead-letter queue
279
+ - **Throughput** (events/sec, bytes/sec): actual vs. capacity headroom
280
+ - **State size**: total operator state in bytes; watch for unbounded growth