@zigrivers/scaffold 3.8.0 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +72 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +135 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,270 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: browser-extension-testing
|
|
3
|
+
description: Extension testing with Puppeteer and Playwright, unit testing shared logic, and manual cross-browser smoke test procedures
|
|
4
|
+
topics: [browser-extension, testing, puppeteer, playwright, unit-testing, e2e, smoke-tests, cross-browser]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Browser extension testing is harder than web app testing because the extension runs in a privileged browser context that most test frameworks cannot easily access. The strategy is to maximize the code that lives in plain TypeScript (easily unit-tested), minimize the code that requires a real browser to test (expensive), and write targeted end-to-end tests that exercise the extension in a real browser for the scenarios that matter most.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Test browser extensions at three levels: unit tests for all shared and context-agnostic logic (Vitest, no browser required), integration tests for message handlers using `jest-chrome` or `sinon-chrome` to mock the `chrome.*` APIs, and end-to-end tests using Playwright or Puppeteer with the extension loaded into a real browser instance. Run a manual cross-browser smoke test checklist before every release. Never consider an extension release done without smoke testing in both Chrome and Firefox.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Testing Strategy Overview
|
|
16
|
+
|
|
17
|
+
The extension architecture determines the testing strategy:
|
|
18
|
+
|
|
19
|
+
- **Shared logic** (`src/shared/`) — Pure TypeScript functions with no browser API dependencies. Unit test with Vitest, Jest, or any standard test runner. No special setup required.
|
|
20
|
+
- **Background handlers** (`src/background/handlers/`) — Functions that call `chrome.*` APIs. Test with chrome API mocks (`jest-chrome` or manual mocks). No real browser required.
|
|
21
|
+
- **Content script logic** (`src/content/`) — DOM manipulation functions. Test with jsdom (Vitest/Jest built-in) for unit tests. Test integration with Playwright for page injection scenarios.
|
|
22
|
+
- **Popup/options UI** (`src/popup/`, `src/options/`) — React/framework components. Test with component testing tools (Vitest + Testing Library). E2E test the full popup flow with Playwright.
|
|
23
|
+
- **Full extension integration** — Test the complete extension loaded in a real browser with Playwright or Puppeteer.
|
|
24
|
+
|
|
25
|
+
### Unit Tests for Shared Logic
|
|
26
|
+
|
|
27
|
+
Unit tests are the fastest feedback loop and highest ROI in extension testing:
|
|
28
|
+
|
|
29
|
+
```typescript
|
|
30
|
+
// tests/unit/shared/url-helpers.test.ts
|
|
31
|
+
import { describe, it, expect } from 'vitest';
|
|
32
|
+
import { matchesPattern, normalizeUrl } from '../../../src/shared/url-helpers';
|
|
33
|
+
|
|
34
|
+
describe('matchesPattern', () => {
|
|
35
|
+
it('matches exact URL', () => {
|
|
36
|
+
expect(matchesPattern('https://example.com/page', 'https://example.com/page')).toBe(true);
|
|
37
|
+
});
|
|
38
|
+
|
|
39
|
+
it('matches wildcard pattern', () => {
|
|
40
|
+
expect(matchesPattern('https://example.com/anything', 'https://example.com/*')).toBe(true);
|
|
41
|
+
});
|
|
42
|
+
|
|
43
|
+
it('rejects non-matching URL', () => {
|
|
44
|
+
expect(matchesPattern('https://other.com/', 'https://example.com/*')).toBe(false);
|
|
45
|
+
});
|
|
46
|
+
});
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
**Architecture rule:** Any logic that can be written without importing `chrome.*` or `browser.*` should be. Utilities for URL parsing, config validation, data transformation, and business logic belong in `src/shared/` and are easily unit-tested.
|
|
50
|
+
|
|
51
|
+
### Mocking chrome.* APIs
|
|
52
|
+
|
|
53
|
+
For testing background message handlers and storage operations without a real browser, use chrome API mocks:
|
|
54
|
+
|
|
55
|
+
**jest-chrome** (for Jest or Vitest with compatibility layer):
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
npm install -D jest-chrome
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
```typescript
|
|
62
|
+
// tests/setup.ts (Vitest global setup)
|
|
63
|
+
import chrome from 'jest-chrome';
|
|
64
|
+
Object.assign(global, { chrome });
|
|
65
|
+
|
|
66
|
+
// Setup default mock implementations
|
|
67
|
+
chrome.storage.sync.get.mockImplementation((keys, callback) => {
|
|
68
|
+
callback?.({ enabled: true, sitesConfig: [] });
|
|
69
|
+
return Promise.resolve({ enabled: true, sitesConfig: [] });
|
|
70
|
+
});
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
```typescript
|
|
74
|
+
// tests/unit/background/message-router.test.ts
|
|
75
|
+
import { describe, it, expect, vi, beforeEach } from 'vitest';
|
|
76
|
+
import chrome from 'jest-chrome';
|
|
77
|
+
import { handleMessage } from '../../../src/background/message-router';
|
|
78
|
+
import { Messages } from '../../../src/shared/messages';
|
|
79
|
+
|
|
80
|
+
describe('handleMessage', () => {
|
|
81
|
+
beforeEach(() => {
|
|
82
|
+
chrome.storage.sync.get.mockClear();
|
|
83
|
+
});
|
|
84
|
+
|
|
85
|
+
it('responds to POPUP_GET_STATUS with current state', async () => {
|
|
86
|
+
chrome.storage.sync.get.mockResolvedValue({ enabled: true });
|
|
87
|
+
|
|
88
|
+
const sendResponse = vi.fn();
|
|
89
|
+
handleMessage(
|
|
90
|
+
{ type: Messages.POPUP_GET_STATUS },
|
|
91
|
+
{ id: chrome.runtime.id },
|
|
92
|
+
sendResponse,
|
|
93
|
+
);
|
|
94
|
+
|
|
95
|
+
await vi.waitFor(() => expect(sendResponse).toHaveBeenCalledWith({ enabled: true }));
|
|
96
|
+
});
|
|
97
|
+
});
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
**Manual chrome mock** (for projects that prefer explicit control):
|
|
101
|
+
|
|
102
|
+
```typescript
|
|
103
|
+
// tests/mocks/chrome.ts
|
|
104
|
+
export const chromeMock = {
|
|
105
|
+
storage: {
|
|
106
|
+
sync: {
|
|
107
|
+
get: vi.fn(),
|
|
108
|
+
set: vi.fn().mockResolvedValue(undefined),
|
|
109
|
+
},
|
|
110
|
+
onChanged: {
|
|
111
|
+
addListener: vi.fn(),
|
|
112
|
+
removeListener: vi.fn(),
|
|
113
|
+
},
|
|
114
|
+
},
|
|
115
|
+
runtime: {
|
|
116
|
+
id: 'test-extension-id',
|
|
117
|
+
sendMessage: vi.fn(),
|
|
118
|
+
onMessage: {
|
|
119
|
+
addListener: vi.fn(),
|
|
120
|
+
},
|
|
121
|
+
},
|
|
122
|
+
tabs: {
|
|
123
|
+
query: vi.fn(),
|
|
124
|
+
sendMessage: vi.fn(),
|
|
125
|
+
},
|
|
126
|
+
};
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### End-to-End Tests with Playwright
|
|
130
|
+
|
|
131
|
+
Playwright supports loading Chrome extensions in a persistent browser context:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
npm install -D @playwright/test
|
|
135
|
+
npx playwright install chromium
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
```typescript
|
|
139
|
+
// tests/e2e/extension.spec.ts
|
|
140
|
+
import { test, expect, chromium, BrowserContext } from '@playwright/test';
|
|
141
|
+
import path from 'path';
|
|
142
|
+
|
|
143
|
+
const extensionPath = path.resolve(__dirname, '../../dist/chrome');
|
|
144
|
+
|
|
145
|
+
let context: BrowserContext;
|
|
146
|
+
|
|
147
|
+
test.beforeAll(async () => {
|
|
148
|
+
context = await chromium.launchPersistentContext('', {
|
|
149
|
+
headless: false, // Extensions require headed mode in Playwright
|
|
150
|
+
args: [
|
|
151
|
+
`--disable-extensions-except=${extensionPath}`,
|
|
152
|
+
`--load-extension=${extensionPath}`,
|
|
153
|
+
],
|
|
154
|
+
});
|
|
155
|
+
});
|
|
156
|
+
|
|
157
|
+
test.afterAll(async () => {
|
|
158
|
+
await context.close();
|
|
159
|
+
});
|
|
160
|
+
|
|
161
|
+
test('extension popup opens and displays status', async () => {
|
|
162
|
+
// Get the extension ID from the background service worker URL
|
|
163
|
+
let extensionId: string;
|
|
164
|
+
const backgroundPages = context.backgroundPages();
|
|
165
|
+
|
|
166
|
+
if (backgroundPages.length > 0) {
|
|
167
|
+
const backgroundPage = backgroundPages[0];
|
|
168
|
+
extensionId = new URL(backgroundPage.url()).hostname;
|
|
169
|
+
} else {
|
|
170
|
+
// Wait for the service worker
|
|
171
|
+
const serviceWorker = await context.waitForEvent('serviceworker');
|
|
172
|
+
extensionId = new URL(serviceWorker.url()).hostname;
|
|
173
|
+
}
|
|
174
|
+
|
|
175
|
+
// Open the popup as a regular page (full Playwright API access)
|
|
176
|
+
const popupPage = await context.newPage();
|
|
177
|
+
await popupPage.goto(`chrome-extension://${extensionId}/popup/index.html`);
|
|
178
|
+
|
|
179
|
+
await expect(popupPage.locator('#status-indicator')).toBeVisible();
|
|
180
|
+
await expect(popupPage.locator('#toggle-btn')).toBeEnabled();
|
|
181
|
+
});
|
|
182
|
+
|
|
183
|
+
test('content script injects on target page', async () => {
|
|
184
|
+
const page = await context.newPage();
|
|
185
|
+
await page.goto('https://example.com');
|
|
186
|
+
|
|
187
|
+
// Wait for content script to inject
|
|
188
|
+
await page.waitForSelector('#my-ext-overlay', { timeout: 5000 });
|
|
189
|
+
|
|
190
|
+
const overlay = page.locator('#my-ext-overlay');
|
|
191
|
+
await expect(overlay).toBeVisible();
|
|
192
|
+
});
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
**Playwright limitations with extensions:**
|
|
196
|
+
- `headless: false` is required — Chrome does not load extensions in headless mode (as of Playwright 1.40). Use `headless: 'new'` with Chromium 112+ for limited headless extension support.
|
|
197
|
+
- The extension must be built before tests run. Wire `build:chrome` to run before the e2e test suite in your CI pipeline.
|
|
198
|
+
|
|
199
|
+
### End-to-End Tests with Puppeteer
|
|
200
|
+
|
|
201
|
+
Puppeteer supports Chrome extensions and can run in CI:
|
|
202
|
+
|
|
203
|
+
```typescript
|
|
204
|
+
// tests/e2e/puppeteer-extension.test.ts
|
|
205
|
+
import puppeteer, { Browser } from 'puppeteer';
|
|
206
|
+
import path from 'path';
|
|
207
|
+
|
|
208
|
+
const extensionPath = path.resolve(__dirname, '../../dist/chrome');
|
|
209
|
+
|
|
210
|
+
let browser: Browser;
|
|
211
|
+
let extensionId: string;
|
|
212
|
+
|
|
213
|
+
beforeAll(async () => {
|
|
214
|
+
browser = await puppeteer.launch({
|
|
215
|
+
headless: false, // Required for extensions
|
|
216
|
+
args: [
|
|
217
|
+
`--disable-extensions-except=${extensionPath}`,
|
|
218
|
+
`--load-extension=${extensionPath}`,
|
|
219
|
+
],
|
|
220
|
+
});
|
|
221
|
+
|
|
222
|
+
// Find the extension ID by checking the background page URL
|
|
223
|
+
const targets = await browser.targets();
|
|
224
|
+
const extensionTarget = targets.find(
|
|
225
|
+
t => t.type() === 'service_worker' && t.url().includes('background')
|
|
226
|
+
);
|
|
227
|
+
extensionId = new URL(extensionTarget!.url()).hostname;
|
|
228
|
+
});
|
|
229
|
+
|
|
230
|
+
afterAll(async () => {
|
|
231
|
+
await browser.close();
|
|
232
|
+
});
|
|
233
|
+
|
|
234
|
+
test('extension popup renders correctly', async () => {
|
|
235
|
+
const page = await browser.newPage();
|
|
236
|
+
await page.goto(`chrome-extension://${extensionId}/popup/index.html`);
|
|
237
|
+
const title = await page.$eval('h1', el => el.textContent);
|
|
238
|
+
expect(title).toBe('My Extension');
|
|
239
|
+
});
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
### Manual Cross-Browser Smoke Test Checklist
|
|
243
|
+
|
|
244
|
+
Automated tests cannot catch everything. Run this checklist before every release:
|
|
245
|
+
|
|
246
|
+
**Chrome smoke test:**
|
|
247
|
+
- [ ] Build `dist/chrome` with `npm run build:chrome`.
|
|
248
|
+
- [ ] Load unpacked from `chrome://extensions`.
|
|
249
|
+
- [ ] Extension icon appears in toolbar with correct icon at 16×16.
|
|
250
|
+
- [ ] Click toolbar icon — popup opens without errors (check DevTools console for the popup page).
|
|
251
|
+
- [ ] All popup controls are interactive and functional.
|
|
252
|
+
- [ ] Toggle enabled/disabled — page content changes as expected.
|
|
253
|
+
- [ ] Navigate to a target URL — content script injects without errors (check page DevTools console).
|
|
254
|
+
- [ ] Open options page (`chrome://extensions` → Details → Extension options) — loads and saves correctly.
|
|
255
|
+
- [ ] Disable the extension — injected content is removed from the page.
|
|
256
|
+
- [ ] Check `chrome://extensions` — no errors shown under the extension card.
|
|
257
|
+
- [ ] Check service worker DevTools (`Inspect views: Service Worker`) — no uncaught errors.
|
|
258
|
+
|
|
259
|
+
**Firefox smoke test:**
|
|
260
|
+
- [ ] Build `dist/firefox` with `npm run build:firefox`.
|
|
261
|
+
- [ ] Load via `about:debugging` → This Firefox → Load Temporary Add-on (select `manifest.json`).
|
|
262
|
+
- [ ] Repeat all functional checks from Chrome smoke test.
|
|
263
|
+
- [ ] Check Browser Console (`Ctrl+Shift+J`) — no extension errors.
|
|
264
|
+
- [ ] Verify `browser.storage.sync` reads/writes work (Firefox syncs separately from Chrome).
|
|
265
|
+
|
|
266
|
+
**Regression test after every change to:**
|
|
267
|
+
- `manifest.json` — Reload the extension and verify all declared features still work.
|
|
268
|
+
- Content script matches — Verify the script injects on all intended URLs and not on excluded URLs.
|
|
269
|
+
- `chrome.storage` schema — Verify existing storage data is read correctly after the schema change.
|
|
270
|
+
- Message types — Verify all message senders and receivers still agree on the message format.
|
|
@@ -0,0 +1,175 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-architecture
|
|
3
|
+
description: Lambda vs Kappa architecture tradeoffs, medallion architecture (bronze/silver/gold), and CDC patterns for data pipelines
|
|
4
|
+
topics: [data-pipeline, architecture, lambda, kappa, medallion, bronze-silver-gold, cdc, change-data-capture]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Data pipeline architecture is the set of structural decisions that determine how data flows from sources to consumers, how it is stored at each stage, and how historical data is reprocessed when logic changes. The wrong architecture creates systems that are either operationally complex (Lambda), too rigid for historical reprocessing (pure streaming), or without clear data quality boundaries (no medallion layers). These decisions are expensive to reverse and must be made explicitly.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Lambda architecture separates batch and streaming paths into two systems with a serving layer that merges them — powerful but operationally expensive. Kappa architecture uses a single streaming system with replayable logs for both real-time and historical processing. The medallion architecture (bronze/silver/gold) provides clear data quality tiers regardless of which processing model is used. CDC (Change Data Capture) is the standard mechanism for streaming relational database changes into pipelines.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Lambda Architecture
|
|
16
|
+
|
|
17
|
+
Lambda architecture processes data through two parallel paths that converge at query time:
|
|
18
|
+
|
|
19
|
+
**Batch layer (high latency, high accuracy)**
|
|
20
|
+
- Reprocesses the complete historical dataset on a schedule (daily, hourly)
|
|
21
|
+
- Produces accurate, complete views by operating on all data
|
|
22
|
+
- Tolerates high latency — results available hours after data arrives
|
|
23
|
+
- Implemented with Spark, Hadoop MapReduce, BigQuery batch jobs
|
|
24
|
+
|
|
25
|
+
**Speed layer (low latency, approximate)**
|
|
26
|
+
- Processes only recent data in real-time as events arrive
|
|
27
|
+
- Produces approximate views with low latency (seconds to minutes)
|
|
28
|
+
- Covers only the gap since the last batch run
|
|
29
|
+
- Implemented with Kafka Streams, Flink, Spark Streaming
|
|
30
|
+
|
|
31
|
+
**Serving layer**
|
|
32
|
+
- Merges batch and speed layer outputs at query time
|
|
33
|
+
- Returns the batch view for historical ranges; supplements with speed layer for recency
|
|
34
|
+
- Implemented with systems like Cassandra, HBase, or a query engine that unions both stores
|
|
35
|
+
|
|
36
|
+
**Lambda tradeoffs**
|
|
37
|
+
- Benefits: Accurate historical data (batch), low-latency recent data (speed), well-understood pattern
|
|
38
|
+
- Costs: Two codebases implementing the same business logic (divergence risk), complex serving layer merging logic, double infrastructure cost, testing complexity
|
|
39
|
+
- Use when: Latency requirements genuinely differ for historical vs. recent data, e.g., a dashboard that shows real-time last-hour data but accurate historical monthly reports
|
|
40
|
+
|
|
41
|
+
The most common Lambda failure mode is the batch and speed layers producing different results for the same time window due to logic divergence. This requires continuous reconciliation effort.
|
|
42
|
+
|
|
43
|
+
### Kappa Architecture
|
|
44
|
+
|
|
45
|
+
Kappa architecture replaces the Lambda dual-path with a single streaming system:
|
|
46
|
+
|
|
47
|
+
**Core premise**: Everything is a stream. Historical reprocessing is just replaying old events through the same streaming pipeline.
|
|
48
|
+
|
|
49
|
+
**Requirements for Kappa**:
|
|
50
|
+
- Event log is durable and replayable (Kafka with long retention, or event store)
|
|
51
|
+
- Events are immutable and ordered within a partition
|
|
52
|
+
- Processing logic is stateless or uses externally managed state (RocksDB, Redis)
|
|
53
|
+
|
|
54
|
+
**Kappa pipeline flow**:
|
|
55
|
+
1. All events land in the replayable log (Kafka, Kinesis, Pulsar)
|
|
56
|
+
2. Streaming jobs process events in real-time
|
|
57
|
+
3. When logic changes, deploy new job version and replay from the beginning of the log
|
|
58
|
+
4. New job catches up to current time; old job is decommissioned
|
|
59
|
+
5. Serving layer reads from the streaming job's output store
|
|
60
|
+
|
|
61
|
+
**Kappa tradeoffs**
|
|
62
|
+
- Benefits: Single codebase for all processing, no logic divergence, simpler architecture, streaming-native
|
|
63
|
+
- Costs: Reprocessing large historical datasets takes time and cluster resources, log retention costs scale with history depth, complex state management for aggregations over long windows
|
|
64
|
+
- Use when: Business logic is unified (same calculation for historical and real-time), event log is durable, acceptable reprocessing latency for backfills
|
|
65
|
+
|
|
66
|
+
**Choosing Lambda vs. Kappa**
|
|
67
|
+
|
|
68
|
+
| Factor | Prefer Lambda | Prefer Kappa |
|
|
69
|
+
|--------|---------------|--------------|
|
|
70
|
+
| Logic divergence risk | High (two systems) | Low (one system) |
|
|
71
|
+
| Infrastructure cost | High | Lower |
|
|
72
|
+
| Real-time latency | Sub-second achievable | Sub-second achievable |
|
|
73
|
+
| Historical reprocessing | Fast (dedicated batch) | Slower (streaming catchup) |
|
|
74
|
+
| Team capability | Strong batch + streaming | Streaming-focused team |
|
|
75
|
+
| Use today | Rarely — mostly legacy | Default choice for new pipelines |
|
|
76
|
+
|
|
77
|
+
### Medallion Architecture
|
|
78
|
+
|
|
79
|
+
The medallion architecture (also called multi-hop) organizes data into three quality tiers regardless of whether Lambda or Kappa is used:
|
|
80
|
+
|
|
81
|
+
**Bronze layer — Raw ingestion**
|
|
82
|
+
- Stores data exactly as received from source systems, with no transformations
|
|
83
|
+
- Append-only; records are never modified or deleted (except for compliance purges)
|
|
84
|
+
- Schema-on-read: no schema enforcement at write time
|
|
85
|
+
- Retains original field names, formats, and any encoding quirks from the source
|
|
86
|
+
- Adds pipeline metadata: `_ingested_at`, `_source_system`, `_pipeline_version`, `_raw_id`
|
|
87
|
+
- Retention: indefinite (or compliance minimum) — this is the recovery point for all downstream layers
|
|
88
|
+
|
|
89
|
+
Bronze is the safety net. When a transformation bug is discovered, bronze data allows rebuilding silver and gold from scratch without re-ingesting from source.
|
|
90
|
+
|
|
91
|
+
**Silver layer — Cleaned and conformed**
|
|
92
|
+
- Applies schema enforcement, type casting, and null handling from bronze
|
|
93
|
+
- Deduplicates records using business keys
|
|
94
|
+
- Normalizes field names to the canonical data model (snake_case, consistent naming)
|
|
95
|
+
- Resolves encoding issues, trims whitespace, standardizes date formats
|
|
96
|
+
- Joins with slowly-changing dimension tables (currency conversion rates, country codes)
|
|
97
|
+
- Enforces data quality rules; routes failing records to DLQ
|
|
98
|
+
- Schema-on-write: strict Avro/Parquet schema applied at write time
|
|
99
|
+
- Retention: 1–3 years (or per regulatory requirement)
|
|
100
|
+
|
|
101
|
+
Silver is the trusted, clean, integrated dataset. Most analytical consumers should read from silver, not bronze.
|
|
102
|
+
|
|
103
|
+
**Gold layer — Aggregated and business-ready**
|
|
104
|
+
- Pre-aggregated views optimized for specific business questions
|
|
105
|
+
- Applies business logic: revenue rollups, user segmentation, cohort calculations
|
|
106
|
+
- Denormalized for query performance (no joins required by consumers)
|
|
107
|
+
- Named for the business concept, not the data source: `daily_revenue`, `user_ltv`, `product_performance`
|
|
108
|
+
- Retention: dependent on business reporting requirements (often indefinitely for monthly/yearly aggregates)
|
|
109
|
+
|
|
110
|
+
Gold is what business users, dashboards, and ML features consume. Schema changes in gold require migration planning and consumer coordination.
|
|
111
|
+
|
|
112
|
+
**Medallion implementation rules**
|
|
113
|
+
- Bronze → Silver is an automated pipeline; never manually edit bronze data
|
|
114
|
+
- Silver → Gold is also automated; never manually edit silver data
|
|
115
|
+
- Data flows only forward (bronze → silver → gold), never backward
|
|
116
|
+
- Consumer systems should read from the highest-quality layer that satisfies their latency requirements
|
|
117
|
+
- Schema changes in bronze require no consumer coordination; changes in silver/gold do
|
|
118
|
+
|
|
119
|
+
### CDC (Change Data Capture) Patterns
|
|
120
|
+
|
|
121
|
+
CDC streams database changes (INSERT, UPDATE, DELETE) as events into the pipeline without polling or application-layer hooks.
|
|
122
|
+
|
|
123
|
+
**Why CDC**
|
|
124
|
+
- Zero-impact on source database performance (reads WAL, not production tables)
|
|
125
|
+
- Sub-second latency from database commit to pipeline event
|
|
126
|
+
- Captures all changes including bulk updates and direct SQL modifications
|
|
127
|
+
- Works for initial data load and ongoing changes through the same mechanism
|
|
128
|
+
|
|
129
|
+
**CDC implementation options**
|
|
130
|
+
|
|
131
|
+
*Log-based CDC (recommended)*: Reads the database Write-Ahead Log (WAL) directly
|
|
132
|
+
- PostgreSQL: logical replication using `pgoutput` plugin (Debezium connector)
|
|
133
|
+
- MySQL: reads binary log (binlog) using Debezium MySQL connector
|
|
134
|
+
- SQL Server: uses Change Data Capture feature built into SQL Server
|
|
135
|
+
|
|
136
|
+
*Query-based CDC (polling)*: Queries for rows modified after a watermark timestamp
|
|
137
|
+
- Simpler to implement but misses DELETEs, requires `updated_at` column, higher DB load
|
|
138
|
+
- Acceptable for low-volume tables where log-based CDC is not available
|
|
139
|
+
|
|
140
|
+
*Triggers*: Database triggers write changes to a staging table
|
|
141
|
+
- High database overhead, blocking risk, not recommended for high-volume tables
|
|
142
|
+
|
|
143
|
+
**Debezium CDC event structure**
|
|
144
|
+
|
|
145
|
+
Debezium transforms each database change into a structured event:
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"before": { "id": 123, "status": "pending", "amount": 100.00 },
|
|
150
|
+
"after": { "id": 123, "status": "completed", "amount": 100.00 },
|
|
151
|
+
"op": "u",
|
|
152
|
+
"ts_ms": 1705329825000,
|
|
153
|
+
"source": {
|
|
154
|
+
"db": "payments",
|
|
155
|
+
"table": "transactions",
|
|
156
|
+
"lsn": 1847392
|
|
157
|
+
}
|
|
158
|
+
}
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
`op` values: `c` (create/insert), `u` (update), `d` (delete), `r` (read, initial snapshot)
|
|
162
|
+
|
|
163
|
+
**CDC pipeline topology**
|
|
164
|
+
|
|
165
|
+
```
|
|
166
|
+
Source DB → Debezium → Kafka (per-table topics) → Stream Processor → Bronze → Silver
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
Each source table gets its own Kafka topic: `{db}.{schema}.{table}`. This allows consumers to subscribe to specific tables independently.
|
|
170
|
+
|
|
171
|
+
**CDC operational considerations**
|
|
172
|
+
- Schema changes in the source database must be handled gracefully (Avro schema evolution, schema registry)
|
|
173
|
+
- Initial snapshot: Debezium can snapshot existing table data before beginning to tail the log — manage this carefully for large tables (can take hours)
|
|
174
|
+
- Replication slot management: PostgreSQL logical replication slots accumulate WAL if the consumer falls behind — monitor replication slot lag as a critical metric
|
|
175
|
+
- Exactly-once delivery: Kafka + Debezium provides at-least-once delivery; implement idempotent consumers for exactly-once semantics
|