@zigrivers/scaffold 3.7.0 → 3.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (97) hide show
  1. package/README.md +113 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/library/library-api-design.md +306 -0
  27. package/content/knowledge/library/library-architecture.md +247 -0
  28. package/content/knowledge/library/library-bundling.md +244 -0
  29. package/content/knowledge/library/library-conventions.md +229 -0
  30. package/content/knowledge/library/library-dev-environment.md +220 -0
  31. package/content/knowledge/library/library-documentation.md +300 -0
  32. package/content/knowledge/library/library-project-structure.md +237 -0
  33. package/content/knowledge/library/library-requirements.md +173 -0
  34. package/content/knowledge/library/library-security.md +257 -0
  35. package/content/knowledge/library/library-testing.md +319 -0
  36. package/content/knowledge/library/library-type-definitions.md +284 -0
  37. package/content/knowledge/library/library-versioning.md +300 -0
  38. package/content/knowledge/ml/ml-architecture.md +172 -0
  39. package/content/knowledge/ml/ml-conventions.md +209 -0
  40. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  41. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  42. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  43. package/content/knowledge/ml/ml-observability.md +253 -0
  44. package/content/knowledge/ml/ml-project-structure.md +216 -0
  45. package/content/knowledge/ml/ml-requirements.md +138 -0
  46. package/content/knowledge/ml/ml-security.md +188 -0
  47. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  48. package/content/knowledge/ml/ml-testing.md +301 -0
  49. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  50. package/content/knowledge/mobile-app/mobile-app-architecture.md +283 -0
  51. package/content/knowledge/mobile-app/mobile-app-conventions.md +180 -0
  52. package/content/knowledge/mobile-app/mobile-app-deployment.md +298 -0
  53. package/content/knowledge/mobile-app/mobile-app-dev-environment.md +257 -0
  54. package/content/knowledge/mobile-app/mobile-app-distribution.md +264 -0
  55. package/content/knowledge/mobile-app/mobile-app-observability.md +317 -0
  56. package/content/knowledge/mobile-app/mobile-app-offline-patterns.md +311 -0
  57. package/content/knowledge/mobile-app/mobile-app-project-structure.md +245 -0
  58. package/content/knowledge/mobile-app/mobile-app-push-notifications.md +321 -0
  59. package/content/knowledge/mobile-app/mobile-app-requirements.md +147 -0
  60. package/content/knowledge/mobile-app/mobile-app-security.md +338 -0
  61. package/content/knowledge/mobile-app/mobile-app-testing.md +400 -0
  62. package/content/methodology/browser-extension-overlay.yml +82 -0
  63. package/content/methodology/data-pipeline-overlay.yml +70 -0
  64. package/content/methodology/library-overlay.yml +67 -0
  65. package/content/methodology/ml-overlay.yml +70 -0
  66. package/content/methodology/mobile-app-overlay.yml +71 -0
  67. package/dist/cli/commands/init.d.ts +22 -0
  68. package/dist/cli/commands/init.d.ts.map +1 -1
  69. package/dist/cli/commands/init.js +202 -3
  70. package/dist/cli/commands/init.js.map +1 -1
  71. package/dist/cli/commands/init.test.js +190 -0
  72. package/dist/cli/commands/init.test.js.map +1 -1
  73. package/dist/config/schema.d.ts +1456 -80
  74. package/dist/config/schema.d.ts.map +1 -1
  75. package/dist/config/schema.js +87 -0
  76. package/dist/config/schema.js.map +1 -1
  77. package/dist/config/schema.test.js +312 -3
  78. package/dist/config/schema.test.js.map +1 -1
  79. package/dist/core/assembly/overlay-loader.test.js +55 -0
  80. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  81. package/dist/e2e/project-type-overlays.test.d.ts +2 -1
  82. package/dist/e2e/project-type-overlays.test.d.ts.map +1 -1
  83. package/dist/e2e/project-type-overlays.test.js +780 -14
  84. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  85. package/dist/types/config.d.ts +16 -1
  86. package/dist/types/config.d.ts.map +1 -1
  87. package/dist/wizard/questions.d.ts +28 -1
  88. package/dist/wizard/questions.d.ts.map +1 -1
  89. package/dist/wizard/questions.js +127 -1
  90. package/dist/wizard/questions.js.map +1 -1
  91. package/dist/wizard/questions.test.js +224 -4
  92. package/dist/wizard/questions.test.js.map +1 -1
  93. package/dist/wizard/wizard.d.ts +22 -0
  94. package/dist/wizard/wizard.d.ts.map +1 -1
  95. package/dist/wizard/wizard.js +28 -1
  96. package/dist/wizard/wizard.js.map +1 -1
  97. package/package.json +1 -1
@@ -0,0 +1,300 @@
1
+ ---
2
+ name: library-versioning
3
+ description: Semver discipline, breaking change detection, release automation, and changelog management for published libraries
4
+ topics: [library, versioning, semver, breaking-changes, release-automation, changelog, changesets]
5
+ ---
6
+
7
+ Library versioning is a communication protocol with consumers. Semver (Semantic Versioning) is not merely a numbering scheme — it is a contract about backward compatibility. Breaking that contract without a major version bump is one of the most damaging things a library can do. Consumers set version ranges expecting that minor updates are safe to take automatically. Violating that expectation causes production incidents for real applications. Versioning discipline must be enforced by tooling, not willpower.
8
+
9
+ ## Summary
10
+
11
+ Enforce semver through tooling: use changesets or semantic-release to automate versioning based on change metadata. Use automated breaking change detection (API Extractor or type-coverage checks) to catch accidental breaking changes before publish. Every release requires a CHANGELOG entry with migration guidance for breaking changes. Pre-releases (`alpha`, `beta`, `rc`) allow consumers to opt into early testing without affecting stable installs. Tag releases in git to enable diff-based changelog generation.
12
+
13
+ Versioning workflow:
14
+ 1. Author creates changeset file describing the change type (patch/minor/major)
15
+ 2. CI aggregates changesets and proposes a version bump PR
16
+ 3. Version bump PR merges, triggering publish to npm
17
+ 4. Git tag pushed matching the published version
18
+ 5. GitHub Release created from changelog content
19
+
20
+ ## Deep Guidance
21
+
22
+ ### Changesets Workflow
23
+
24
+ Changesets is the recommended tool for managing versioning in library projects. It decouples the decision of "what version bump does this change require" from "when do we publish."
25
+
26
+ **Setup:**
27
+ ```bash
28
+ npm install --save-dev @changesets/cli
29
+ npx changeset init
30
+ ```
31
+
32
+ This creates a `.changeset/` directory at the project root.
33
+
34
+ **Creating a changeset (run for every PR that changes behavior):**
35
+ ```bash
36
+ npx changeset add
37
+ # Interactive prompt:
38
+ # ? Which packages would you like to include? my-library
39
+ # ? What type of change is this for my-library?
40
+ # major (Breaking change)
41
+ # minor (New feature)
42
+ # > patch (Bug fix)
43
+ # ? Please enter a summary for this change:
44
+ # Fix parseConfig() incorrectly ignoring the encoding option
45
+ ```
46
+
47
+ This creates a markdown file in `.changeset/`:
48
+ ```markdown
49
+ <!-- .changeset/silver-wolves-grin.md -->
50
+ ---
51
+ "my-library": patch
52
+ ---
53
+
54
+ Fix parseConfig() incorrectly ignoring the encoding option when parsing file input.
55
+ ```
56
+
57
+ **Version bump and publish:**
58
+ ```bash
59
+ # Update package.json version and CHANGELOG.md
60
+ npx changeset version
61
+
62
+ # Publish to npm
63
+ npx changeset publish
64
+ ```
65
+
66
+ In CI, automate this with the Changesets GitHub Action:
67
+ ```yaml
68
+ # .github/workflows/release.yml
69
+ # name: Release — CI publishes on push to main
70
+ on:
71
+ push:
72
+ branches: [main]
73
+
74
+ jobs:
75
+ release:
76
+ runs-on: ubuntu-latest
77
+ steps:
78
+ - uses: actions/checkout@v4
79
+ - uses: actions/setup-node@v4
80
+ with:
81
+ node-version: 20
82
+ registry-url: 'https://registry.npmjs.org'
83
+ - run: npm ci
84
+ - uses: changesets/action@v1
85
+ with:
86
+ publish: npm run release
87
+ env:
88
+ GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
89
+ NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
90
+ ```
91
+
92
+ The action opens a "Version Packages" PR when changesets are present and publishes when that PR merges.
93
+
94
+ ### Breaking Change Detection with API Extractor
95
+
96
+ Microsoft's API Extractor catches breaking changes by comparing the current API surface against a committed baseline:
97
+
98
+ **Setup:**
99
+ ```bash
100
+ npm install --save-dev @microsoft/api-extractor
101
+ npx api-extractor init
102
+ ```
103
+
104
+ ```json
105
+ // api-extractor.json (key settings)
106
+ {
107
+ "mainEntryPointFilePath": "<projectFolder>/dist/types/index.d.ts",
108
+ "apiReport": {
109
+ "enabled": true,
110
+ "reportFolder": "<projectFolder>/etc/"
111
+ },
112
+ "docModel": {
113
+ "enabled": true
114
+ }
115
+ }
116
+ ```
117
+
118
+ ```bash
119
+ # Generate initial API report (commit this file)
120
+ npx api-extractor run --local
121
+
122
+ # In CI: compare against committed report
123
+ npx api-extractor run
124
+ # Fails if the API surface changed in ways not reflected in the committed report
125
+ ```
126
+
127
+ The generated `etc/my-library.api.md` file shows the complete public API surface in a reviewable format. When a PR changes it, reviewers can see exactly what changed. If the change is intentional, update the committed report; if not, fix the breaking change.
128
+
129
+ **API report excerpt:**
130
+ ```markdown
131
+ // @public
132
+ export function parseConfig(input: string, options?: ParseOptions): Config;
133
+
134
+ // @public
135
+ export interface ParseOptions {
136
+ encoding?: BufferEncoding;
137
+ strict?: boolean;
138
+ }
139
+
140
+ // @public
141
+ export class ParseError extends Error {
142
+ constructor(message: string, line: number, column: number);
143
+ readonly column: number;
144
+ readonly line: number;
145
+ }
146
+ ```
147
+
148
+ This format makes breaking changes immediately visible in code review.
149
+
150
+ ### Pre-Release Channels
151
+
152
+ Pre-releases allow consumers to test upcoming changes without affecting stable installs:
153
+
154
+ **With changesets:**
155
+ ```bash
156
+ # Enter pre-release mode
157
+ npx changeset pre enter alpha
158
+ # Or: beta, rc
159
+
160
+ # Create changesets and version as normal
161
+ npx changeset add
162
+ npx changeset version
163
+ # Produces: 2.0.0-alpha.1
164
+
165
+ # Exit pre-release mode
166
+ npx changeset pre exit
167
+ ```
168
+
169
+ **Manual pre-release versioning:**
170
+ ```json
171
+ // package.json
172
+ "version": "2.0.0-alpha.1"
173
+ ```
174
+
175
+ ```bash
176
+ npm publish --tag alpha
177
+ # Consumers opt-in: npm install my-library@alpha
178
+ # Stable consumers (npm install my-library) are unaffected
179
+ ```
180
+
181
+ **Pre-release channel strategy:**
182
+ - `alpha` — internal testing only, may change drastically, no API stability
183
+ - `beta` — public testing, API reasonably stable, looking for feedback
184
+ - `rc` (release candidate) — API frozen, looking for final integration issues
185
+ - Stable — semver protected, change policy enforced
186
+
187
+ ### Release Automation with GitHub Actions
188
+
189
+ Full release workflow with provenance and attestation:
190
+
191
+ ```yaml
192
+ # .github/workflows/release.yml
193
+ # name: Release — CI publishes on tag push
194
+ on:
195
+ push:
196
+ tags:
197
+ - 'v*'
198
+
199
+ permissions:
200
+ contents: write
201
+ id-token: write # For npm provenance
202
+
203
+ jobs:
204
+ release:
205
+ runs-on: ubuntu-latest
206
+ steps:
207
+ - uses: actions/checkout@v4
208
+
209
+ - uses: actions/setup-node@v4
210
+ with:
211
+ node-version: '20'
212
+ registry-url: 'https://registry.npmjs.org'
213
+
214
+ - name: Install dependencies
215
+ run: npm ci
216
+
217
+ - name: Build
218
+ run: npm run build
219
+
220
+ - name: Test
221
+ run: npm test
222
+
223
+ - name: Publish to npm
224
+ run: npm publish --provenance --access public
225
+ env:
226
+ NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
227
+
228
+ - name: Create GitHub Release
229
+ uses: softprops/action-gh-release@v2
230
+ with:
231
+ generate_release_notes: true
232
+ draft: false
233
+ ```
234
+
235
+ The `--provenance` flag publishes npm provenance attestation — a cryptographic link between the published package and the GitHub Actions run that built it. This allows consumers to verify the package was built from the expected source.
236
+
237
+ ### CHANGELOG Generation
238
+
239
+ Keep a Changelog format, managed automatically:
240
+
241
+ ```bash
242
+ # With conventional commits, generate changelog automatically:
243
+ npx conventional-changelog-cli -p angular -i CHANGELOG.md -s
244
+
245
+ # Or with changesets:
246
+ npx changeset version # Updates CHANGELOG.md automatically
247
+ ```
248
+
249
+ **Manual CHANGELOG structure:**
250
+ ```markdown
251
+ # Changelog
252
+
253
+ All notable changes to this project will be documented in this file.
254
+ Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
255
+ Versioning: [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
256
+
257
+ ## [Unreleased]
258
+
259
+ ## [2.1.0] - 2024-03-15
260
+
261
+ ### Added
262
+ - `parseConfigFile(path)` for file-based parsing
263
+ - `ParseOptions.maxSize` to limit input size
264
+
265
+ ### Fixed
266
+ - `parseConfig()` no longer ignores `encoding` option for buffer inputs
267
+
268
+ ## [2.0.0] - 2024-01-10
269
+
270
+ ### Breaking Changes
271
+ - Removed `parse()` (deprecated in 1.5.0). Replacement: `parseConfig()`.
272
+ - `Config.timeout` is now milliseconds (was seconds). Multiply existing values × 1000.
273
+ - Dropped Node 16 support. Minimum: Node 18.
274
+
275
+ ### Migration Guide
276
+ See: https://my-library.dev/guides/migration-v2
277
+
278
+ [Unreleased]: https://github.com/org/my-library/compare/v2.1.0...HEAD
279
+ [2.1.0]: https://github.com/org/my-library/compare/v2.0.0...v2.1.0
280
+ [2.0.0]: https://github.com/org/my-library/releases/tag/v2.0.0
281
+ ```
282
+
283
+ ### Git Tag Strategy
284
+
285
+ Tag every release:
286
+ ```bash
287
+ # After publishing to npm
288
+ git tag v2.1.0
289
+ git push origin v2.1.0
290
+
291
+ # Or use npm version (updates package.json, commits, and tags)
292
+ npm version minor -m "chore(release): v%s"
293
+ git push && git push --tags
294
+ ```
295
+
296
+ Tags are the source of truth for "what was published when." They enable:
297
+ - Reproducible builds from any historical version
298
+ - `git diff v2.0.0 v2.1.0` to review what changed between releases
299
+ - Automated changelog generation tools
300
+ - GitHub Release creation linked to the exact commit
@@ -0,0 +1,172 @@
1
+ ---
2
+ name: ml-architecture
3
+ description: Training/serving architecture split, feature stores, model registry, online vs offline inference patterns, and ML system design decisions
4
+ topics: [ml, architecture, feature-store, model-registry, inference, serving, training]
5
+ ---
6
+
7
+ ML systems have a fundamental architectural split that traditional software does not: the training system and the serving system are different codebases running on different infrastructure, yet they must agree on the exact same data transformations. This training-serving skew is the most common source of silent production bugs in ML. Designing the architecture to prevent skew — through shared feature stores, shared preprocessing libraries, and strict interface contracts — is the most important ML architecture decision.
8
+
9
+ ## Summary
10
+
11
+ ML architecture separates training (batch, data-intensive, experimental) from serving (latency-sensitive, stateless, reliable). Feature stores eliminate training-serving skew by providing consistent feature computation for both. Model registries provide the governance layer: versioning, lineage, deployment gates, and rollback. Choose online vs. offline inference based on latency requirements and data freshness needs. Document every architectural decision as an ADR.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Training/Serving Architecture Split
16
+
17
+ The training system and serving system have fundamentally different requirements:
18
+
19
+ | Dimension | Training | Serving |
20
+ |-----------|----------|---------|
21
+ | Throughput | High (process TB of data) | High (handle thousands of RPS) |
22
+ | Latency | Not time-critical | P99 < 200ms |
23
+ | Consistency | Best-effort | Exact reproducibility |
24
+ | Infrastructure | Spot instances, GPU clusters | Reserved instances, autoscaling |
25
+ | State | Stateful (checkpoint, resume) | Stateless (each request independent) |
26
+ | Failures | Retry / resume from checkpoint | Circuit breaker, fallback |
27
+
28
+ The primary risk of this split is **training-serving skew**: the model was trained with feature X computed one way, but production computes feature X a slightly different way, leading to silent accuracy degradation.
29
+
30
+ **Prevention strategies**:
31
+ 1. **Shared feature library**: Both training and serving import the same `src/features/` module for all feature computation. Never duplicate feature logic.
32
+ 2. **Feature store**: Centralised store that computes features once and serves them to both training (historical) and serving (real-time) paths.
33
+ 3. **Schema validation**: Validate model inputs at serving time against the schema seen during training.
34
+
35
+ ### Feature Store Architecture
36
+
37
+ A feature store is a data infrastructure component that stores and serves pre-computed features:
38
+
39
+ ```
40
+ ┌─────────────────┐
41
+ Data Sources ──────►│ Feature Pipeline │
42
+ └────────┬────────┘
43
+ │ compute features
44
+ ┌────────▼────────┐
45
+ │ Feature Store │
46
+ │ ┌───────────┐ │
47
+ │ │ Offline │ │──► Training Data
48
+ │ │ (S3/GCS) │ │
49
+ │ ├───────────┤ │
50
+ │ │ Online │ │──► Serving (low-latency)
51
+ │ │ (Redis) │ │
52
+ └─────────────────┘
53
+ ```
54
+
55
+ **Offline store**: Historical features for training. Backed by object storage (S3, GCS) or a columnar database (BigQuery, Snowflake). Supports point-in-time correct feature retrieval (no data leakage).
56
+
57
+ **Online store**: Low-latency feature serving for real-time inference. Backed by Redis, DynamoDB, or Cassandra. Stores only the latest feature values.
58
+
59
+ **Implementations**: Feast (open source), Tecton, Vertex AI Feature Store, SageMaker Feature Store.
60
+
61
+ A feature store is justified when:
62
+ - Multiple models use the same features (DRY for features)
63
+ - Training-serving skew is causing production issues
64
+ - Feature computation is expensive and should be shared
65
+
66
+ ### Model Registry
67
+
68
+ The model registry is the governance layer between training and production. Every model that has ever been trained should have a registry record:
69
+
70
+ ```
71
+ Training Run ──► Model Artifacts ──► Registry ──► Deployment
72
+ (weights, config) (metadata, (staging,
73
+ lineage) production)
74
+ ```
75
+
76
+ **Registry metadata per model version**:
77
+ - Model name and semantic version (`fraud-detector-v2.3.1`)
78
+ - Training run ID and commit SHA (full lineage)
79
+ - Training metrics (AUC, F1, etc.)
80
+ - Evaluation metrics on holdout sets
81
+ - Training dataset version
82
+ - Model schema (input/output feature names and types)
83
+ - Serving requirements (runtime, memory, GPU)
84
+ - Promotion history (who promoted, when, why)
85
+
86
+ **Lifecycle stages**:
87
+ ```
88
+ None → Staging → Production → Archived
89
+ ```
90
+
91
+ - Validation gates control promotion: automated tests (accuracy thresholds, latency budgets) plus optional human approval
92
+ - Production always has at least one previous version for rollback
93
+ - Archive on a schedule (keep 6 months of production versions)
94
+
95
+ **Implementations**: MLflow Model Registry, Weights & Biases Model Registry, SageMaker Model Registry.
96
+
97
+ ### Online vs. Offline Inference
98
+
99
+ **Online inference** (real-time, synchronous):
100
+ - Model returns predictions in response to a request, typically within 100–500ms
101
+ - Examples: fraud scoring at checkout, recommendation on page load, search ranking
102
+ - Infrastructure: Model server (TorchServe, Triton, BentoML), autoscaled behind a load balancer
103
+ - Key considerations: latency budget, model size (must fit in serving memory), cold start time
104
+
105
+ **Offline inference** (batch, asynchronous):
106
+ - Model scores a large dataset on a schedule, stores predictions for later retrieval
107
+ - Examples: churn prediction for all users (run nightly), content pre-scoring for a recommendation cache
108
+ - Infrastructure: Spark job, Airflow DAG, Ray cluster, or simple Python script on a large VM
109
+ - Key considerations: throughput (records/second), data pipeline integration, prediction freshness
110
+
111
+ **Near-real-time / stream inference**:
112
+ - Model scores events from a stream (Kafka, Kinesis) with seconds-to-minutes latency
113
+ - Examples: anomaly detection on clickstream, session-level personalisation
114
+ - Infrastructure: Kafka consumer + model inference worker, Flink ML, or Spark Structured Streaming
115
+ - Key considerations: exactly-once semantics, ordering guarantees, backpressure handling
116
+
117
+ **Decision matrix**:
118
+
119
+ | Use Case | Latency Requirement | Data Freshness | Approach |
120
+ |----------|---------------------|----------------|----------|
121
+ | Checkout fraud | < 500ms | Real-time | Online inference |
122
+ | Churn prediction | N/A | Daily | Offline batch |
123
+ | Email personalisation | N/A (send time) | Daily | Offline batch |
124
+ | Feed ranking | < 200ms | Real-time | Online inference |
125
+ | Anomaly detection | < 30 seconds | Streaming | Stream inference |
126
+
127
+ ### ML System Design Patterns
128
+
129
+ **Lambda Architecture for ML**:
130
+ - Batch layer: nightly model retraining or batch scoring
131
+ - Speed layer: real-time model updates or online scoring
132
+ - Serving layer: unified API serving precomputed (batch) + real-time predictions
133
+ - Complexity cost is high — evaluate whether the freshness gain justifies it
134
+
135
+ **Two-Tower Architecture** (recommendation systems):
136
+ - Candidate generation tower: fast approximate nearest-neighbour retrieval (ANN index)
137
+ - Ranking tower: expensive full model scoring of top-K candidates
138
+ - Separates recall (retrieve thousands of candidates quickly) from precision (rank them accurately)
139
+
140
+ **Shadow Mode Deployment**:
141
+ - New model runs in parallel with production model, receiving real traffic
142
+ - New model's predictions are not served but are logged and evaluated
143
+ - Safe way to validate new models with real data before full deployment
144
+
145
+ ### Architecture Decision Record Template for ML
146
+
147
+ ```markdown
148
+ # ADR-ML-001: Online vs. Offline Inference for Recommendation
149
+
150
+ ## Status
151
+ Accepted — 2024-03-15
152
+
153
+ ## Context
154
+ Product recommends items to users on homepage. 5M daily active users.
155
+ Data team produces updated user embeddings daily. Item catalog updates hourly.
156
+
157
+ ## Decision
158
+ Use offline batch inference for recommendation.
159
+ - Nightly batch job scores all user-item pairs for top 100 candidates
160
+ - Recommendations stored in Redis with TTL of 26 hours
161
+ - API reads from Redis — zero model inference at request time
162
+
163
+ ## Consequences
164
+ - P99 homepage latency drops from 450ms to 20ms (Redis lookup vs. model inference)
165
+ - Recommendations are up to 26 hours stale — acceptable given content update frequency
166
+ - Requires Redis cluster (operational cost)
167
+ - Model update cycle is daily — cannot react to same-session behaviour
168
+
169
+ ## Alternatives Rejected
170
+ - Online inference: latency budget exceeded (model P99 = 380ms)
171
+ - Streaming inference: engineering complexity not justified given daily embedding update cadence
172
+ ```
@@ -0,0 +1,209 @@
1
+ ---
2
+ name: ml-conventions
3
+ description: Experiment naming, model versioning, reproducibility via random seeds, config-as-code patterns, and team conventions for ML projects
4
+ topics: [ml, conventions, reproducibility, versioning, config, experiments]
5
+ ---
6
+
7
+ ML projects without conventions degenerate into chaos within weeks: unnamed experiments with lost hyperparameters, models named `model_v2_final_FINAL.pkl`, and results that cannot be reproduced. Unlike software engineering where the compiler enforces structure, ML workflows are loose scripts and notebooks that require disciplined conventions to remain comprehensible. Establish these conventions at project start and encode them in tooling so they are followed by default, not willpower.
8
+
9
+ ## Summary
10
+
11
+ ML conventions cover experiment naming (structured, searchable identifiers), model versioning (semantic or content-addressed), reproducibility (seeding all random sources, recording environment), and config-as-code (no magic numbers in code, all hyperparameters in config files). These conventions are not optional hygiene — they are the infrastructure that makes ML engineering a repeatable discipline rather than a research lottery.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Experiment Naming
16
+
17
+ Every training run that produces artifacts or results must have a unique, human-readable identifier. Ad-hoc names like `test`, `v2`, or `new_model` are unusable at scale:
18
+
19
+ **Recommended format**: `{model_type}-{dataset}-{date}-{purpose}[-{variant}]`
20
+
21
+ Examples:
22
+ - `resnet50-imagenet-20240315-baseline`
23
+ - `bert-sst2-20240315-lr-sweep`
24
+ - `xgboost-churn-20240320-feature-v3`
25
+ - `gpt2-reviews-20240322-dropout-ablation`
26
+
27
+ **Rules**:
28
+ - All lowercase, hyphen-separated (no spaces, no underscores)
29
+ - Date in `YYYYMMDD` format (sorts chronologically)
30
+ - Purpose is human-readable and specific — not `experiment1` or `test`
31
+ - Variant suffix for ablations and sweeps (`-v2`, `-no-dropout`, `-lr-1e-3`)
32
+
33
+ Many teams use auto-generated experiment IDs (MLflow assigns UUID-based IDs automatically) and rely on tagging/metadata for search. This is fine as a secondary system, but always add a human-readable display name.
34
+
35
+ ### Model Versioning
36
+
37
+ Model versioning is distinct from experiment tracking. A version is a production artifact; an experiment is a training run:
38
+
39
+ **Semantic versioning for models**:
40
+ - `v{major}.{minor}.{patch}` — consistent with software versioning
41
+ - Major: Breaking change in model interface (input/output schema, preprocessing contract)
42
+ - Minor: Meaningful accuracy improvement or new feature support
43
+ - Patch: Bug fix, minor data update, no interface change
44
+
45
+ **Content-addressed versioning** (used by MLflow Model Registry, DVC):
46
+ - Models are identified by a hash of their weights + config
47
+ - Prevents accidental overwriting
48
+ - Enables exact reproducibility — "which weights produced this prediction?"
49
+
50
+ **Registry-based model lifecycle**:
51
+ ```
52
+ Staging → Validation → Production → Archived
53
+ ```
54
+ - Never promote directly to Production — always pass through Staging validation
55
+ - Keep at least one previous Production version for instant rollback
56
+ - Document promotion reason: "Promoted: +2.3% AUC on Q1 eval set, latency within budget"
57
+
58
+ ### Reproducibility
59
+
60
+ ML reproducibility means: given the same code, data, and config, the same model is produced. Achieve it through four controls:
61
+
62
+ **1. Random seed management**
63
+
64
+ Set all random sources before any computation:
65
+ ```python
66
+ import random
67
+ import numpy as np
68
+ import torch
69
+
70
+ def set_seed(seed: int) -> None:
71
+ random.seed(seed)
72
+ np.random.seed(seed)
73
+ torch.manual_seed(seed)
74
+ torch.cuda.manual_seed_all(seed)
75
+ # For full determinism (may impact performance)
76
+ torch.backends.cudnn.deterministic = True
77
+ torch.backends.cudnn.benchmark = False
78
+ ```
79
+
80
+ Record the seed in experiment config. Default seed: `42` (or any fixed value — consistency matters more than the value). When running hyperparameter sweeps with multiple seeds, record all seeds and report mean ± std.
81
+
82
+ **2. Dependency pinning**
83
+
84
+ Pin all dependencies to exact versions:
85
+ ```toml
86
+ # pyproject.toml (Poetry)
87
+ [tool.poetry.dependencies]
88
+ python = "3.11.4"
89
+ torch = "2.1.0"
90
+ transformers = "4.35.2"
91
+ numpy = "1.26.0"
92
+ ```
93
+
94
+ Use `poetry.lock` or `requirements.txt` generated by `pip freeze`. Never use unpinned dependencies (`torch>=2.0`) in a training environment.
95
+
96
+ **3. Data versioning**
97
+
98
+ Record the exact dataset version used for each training run:
99
+ - DVC: content-addressed data with `dvc add` and `.dvc` pointers
100
+ - Dataset registry: log dataset name + version + hash in experiment metadata
101
+ - SQL-based datasets: log the query hash and execution timestamp
102
+
103
+ **4. Environment reproducibility**
104
+
105
+ Capture the full environment:
106
+ ```bash
107
+ # Save environment
108
+ conda env export > environment.yml
109
+ pip freeze > requirements-frozen.txt
110
+
111
+ # Record GPU driver and CUDA version
112
+ nvidia-smi --query-gpu=driver_version,name --format=csv
113
+ nvcc --version
114
+ ```
115
+
116
+ For full environment isolation, use Docker. The Dockerfile is the environment specification.
117
+
118
+ ### Config-as-Code
119
+
120
+ No magic numbers in code. Every hyperparameter, data path, and training setting belongs in a config file:
121
+
122
+ **Bad** (magic numbers scattered in code):
123
+ ```python
124
+ optimizer = Adam(model.parameters(), lr=0.001)
125
+ scheduler = CosineAnnealingLR(optimizer, T_max=100)
126
+ train_loader = DataLoader(dataset, batch_size=32, num_workers=4)
127
+ ```
128
+
129
+ **Good** (config-driven):
130
+ ```yaml
131
+ # configs/train.yaml
132
+ training:
133
+ seed: 42
134
+ epochs: 100
135
+ batch_size: 32
136
+ num_workers: 4
137
+
138
+ optimizer:
139
+ type: adam
140
+ lr: 1.0e-3
141
+ weight_decay: 1.0e-4
142
+
143
+ scheduler:
144
+ type: cosine_annealing
145
+ t_max: 100
146
+ ```
147
+
148
+ ```python
149
+ # src/training/train.py
150
+ def train(cfg: DictConfig) -> None:
151
+ set_seed(cfg.training.seed)
152
+ optimizer = build_optimizer(model, cfg.optimizer)
153
+ scheduler = build_scheduler(optimizer, cfg.scheduler)
154
+ ```
155
+
156
+ Use **Hydra** (Meta) or **OmegaConf** for hierarchical config management with CLI override support:
157
+ ```bash
158
+ # Override from CLI without changing config files
159
+ python train.py optimizer.lr=1e-4 training.batch_size=64
160
+ ```
161
+
162
+ **Config file organization**:
163
+ ```
164
+ configs/
165
+ base.yaml # Default config for all experiments
166
+ model/
167
+ resnet50.yaml
168
+ vit-b16.yaml
169
+ data/
170
+ imagenet.yaml
171
+ cifar10.yaml
172
+ training/
173
+ fast.yaml # Low-epoch for debugging
174
+ full.yaml # Production training
175
+ ```
176
+
177
+ ### Code and Notebook Conventions
178
+
179
+ **Notebooks are for exploration, not production**:
180
+ - Notebooks belong in `notebooks/` — never in `src/`
181
+ - Notebooks must be cleared before committing (no large outputs committed to git)
182
+ - Meaningful results from notebooks are refactored into `src/` modules with tests
183
+
184
+ **Module structure conventions**:
185
+ - `src/data/` — dataset classes, data loaders, preprocessing transforms
186
+ - `src/models/` — model architectures (no training logic)
187
+ - `src/training/` — training loop, loss functions, callbacks
188
+ - `src/evaluation/` — metrics, evaluation runners
189
+ - `src/serving/` — inference code, prediction pipelines
190
+
191
+ **Naming conventions**:
192
+ - Files: `snake_case.py`
193
+ - Classes: `PascalCase` (e.g., `ResNet50Classifier`, `ChurnDataset`)
194
+ - Functions: `snake_case` (e.g., `compute_f1_score`, `load_checkpoint`)
195
+ - Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LENGTH = 512`)
196
+ - Config keys: `snake_case` in YAML
197
+
198
+ ### Checklist Before Starting a Training Run
199
+
200
+ ```
201
+ [ ] Experiment name follows naming convention
202
+ [ ] Random seed set and recorded in config
203
+ [ ] Config file committed (not just command-line overrides)
204
+ [ ] Dataset version recorded
205
+ [ ] Experiment tracker (MLflow/W&B) initialized with run metadata
206
+ [ ] Code committed to git (note the commit SHA in the experiment)
207
+ [ ] Output directory created and named consistently
208
+ [ ] Hardware/environment recorded
209
+ ```