@zigrivers/scaffold 3.7.0 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +113 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/library/library-api-design.md +306 -0
- package/content/knowledge/library/library-architecture.md +247 -0
- package/content/knowledge/library/library-bundling.md +244 -0
- package/content/knowledge/library/library-conventions.md +229 -0
- package/content/knowledge/library/library-dev-environment.md +220 -0
- package/content/knowledge/library/library-documentation.md +300 -0
- package/content/knowledge/library/library-project-structure.md +237 -0
- package/content/knowledge/library/library-requirements.md +173 -0
- package/content/knowledge/library/library-security.md +257 -0
- package/content/knowledge/library/library-testing.md +319 -0
- package/content/knowledge/library/library-type-definitions.md +284 -0
- package/content/knowledge/library/library-versioning.md +300 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/knowledge/mobile-app/mobile-app-architecture.md +283 -0
- package/content/knowledge/mobile-app/mobile-app-conventions.md +180 -0
- package/content/knowledge/mobile-app/mobile-app-deployment.md +298 -0
- package/content/knowledge/mobile-app/mobile-app-dev-environment.md +257 -0
- package/content/knowledge/mobile-app/mobile-app-distribution.md +264 -0
- package/content/knowledge/mobile-app/mobile-app-observability.md +317 -0
- package/content/knowledge/mobile-app/mobile-app-offline-patterns.md +311 -0
- package/content/knowledge/mobile-app/mobile-app-project-structure.md +245 -0
- package/content/knowledge/mobile-app/mobile-app-push-notifications.md +321 -0
- package/content/knowledge/mobile-app/mobile-app-requirements.md +147 -0
- package/content/knowledge/mobile-app/mobile-app-security.md +338 -0
- package/content/knowledge/mobile-app/mobile-app-testing.md +400 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/library-overlay.yml +67 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/content/methodology/mobile-app-overlay.yml +71 -0
- package/dist/cli/commands/init.d.ts +22 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +202 -3
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +190 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +1456 -80
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +87 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +312 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +55 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -1
- package/dist/e2e/project-type-overlays.test.d.ts.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +780 -14
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +16 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +28 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +127 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +224 -4
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +22 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +28 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,300 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: library-versioning
|
|
3
|
+
description: Semver discipline, breaking change detection, release automation, and changelog management for published libraries
|
|
4
|
+
topics: [library, versioning, semver, breaking-changes, release-automation, changelog, changesets]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Library versioning is a communication protocol with consumers. Semver (Semantic Versioning) is not merely a numbering scheme — it is a contract about backward compatibility. Breaking that contract without a major version bump is one of the most damaging things a library can do. Consumers set version ranges expecting that minor updates are safe to take automatically. Violating that expectation causes production incidents for real applications. Versioning discipline must be enforced by tooling, not willpower.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Enforce semver through tooling: use changesets or semantic-release to automate versioning based on change metadata. Use automated breaking change detection (API Extractor or type-coverage checks) to catch accidental breaking changes before publish. Every release requires a CHANGELOG entry with migration guidance for breaking changes. Pre-releases (`alpha`, `beta`, `rc`) allow consumers to opt into early testing without affecting stable installs. Tag releases in git to enable diff-based changelog generation.
|
|
12
|
+
|
|
13
|
+
Versioning workflow:
|
|
14
|
+
1. Author creates changeset file describing the change type (patch/minor/major)
|
|
15
|
+
2. CI aggregates changesets and proposes a version bump PR
|
|
16
|
+
3. Version bump PR merges, triggering publish to npm
|
|
17
|
+
4. Git tag pushed matching the published version
|
|
18
|
+
5. GitHub Release created from changelog content
|
|
19
|
+
|
|
20
|
+
## Deep Guidance
|
|
21
|
+
|
|
22
|
+
### Changesets Workflow
|
|
23
|
+
|
|
24
|
+
Changesets is the recommended tool for managing versioning in library projects. It decouples the decision of "what version bump does this change require" from "when do we publish."
|
|
25
|
+
|
|
26
|
+
**Setup:**
|
|
27
|
+
```bash
|
|
28
|
+
npm install --save-dev @changesets/cli
|
|
29
|
+
npx changeset init
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
This creates a `.changeset/` directory at the project root.
|
|
33
|
+
|
|
34
|
+
**Creating a changeset (run for every PR that changes behavior):**
|
|
35
|
+
```bash
|
|
36
|
+
npx changeset add
|
|
37
|
+
# Interactive prompt:
|
|
38
|
+
# ? Which packages would you like to include? my-library
|
|
39
|
+
# ? What type of change is this for my-library?
|
|
40
|
+
# major (Breaking change)
|
|
41
|
+
# minor (New feature)
|
|
42
|
+
# > patch (Bug fix)
|
|
43
|
+
# ? Please enter a summary for this change:
|
|
44
|
+
# Fix parseConfig() incorrectly ignoring the encoding option
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
This creates a markdown file in `.changeset/`:
|
|
48
|
+
```markdown
|
|
49
|
+
<!-- .changeset/silver-wolves-grin.md -->
|
|
50
|
+
---
|
|
51
|
+
"my-library": patch
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
Fix parseConfig() incorrectly ignoring the encoding option when parsing file input.
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
**Version bump and publish:**
|
|
58
|
+
```bash
|
|
59
|
+
# Update package.json version and CHANGELOG.md
|
|
60
|
+
npx changeset version
|
|
61
|
+
|
|
62
|
+
# Publish to npm
|
|
63
|
+
npx changeset publish
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
In CI, automate this with the Changesets GitHub Action:
|
|
67
|
+
```yaml
|
|
68
|
+
# .github/workflows/release.yml
|
|
69
|
+
# name: Release — CI publishes on push to main
|
|
70
|
+
on:
|
|
71
|
+
push:
|
|
72
|
+
branches: [main]
|
|
73
|
+
|
|
74
|
+
jobs:
|
|
75
|
+
release:
|
|
76
|
+
runs-on: ubuntu-latest
|
|
77
|
+
steps:
|
|
78
|
+
- uses: actions/checkout@v4
|
|
79
|
+
- uses: actions/setup-node@v4
|
|
80
|
+
with:
|
|
81
|
+
node-version: 20
|
|
82
|
+
registry-url: 'https://registry.npmjs.org'
|
|
83
|
+
- run: npm ci
|
|
84
|
+
- uses: changesets/action@v1
|
|
85
|
+
with:
|
|
86
|
+
publish: npm run release
|
|
87
|
+
env:
|
|
88
|
+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
89
|
+
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
The action opens a "Version Packages" PR when changesets are present and publishes when that PR merges.
|
|
93
|
+
|
|
94
|
+
### Breaking Change Detection with API Extractor
|
|
95
|
+
|
|
96
|
+
Microsoft's API Extractor catches breaking changes by comparing the current API surface against a committed baseline:
|
|
97
|
+
|
|
98
|
+
**Setup:**
|
|
99
|
+
```bash
|
|
100
|
+
npm install --save-dev @microsoft/api-extractor
|
|
101
|
+
npx api-extractor init
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
```json
|
|
105
|
+
// api-extractor.json (key settings)
|
|
106
|
+
{
|
|
107
|
+
"mainEntryPointFilePath": "<projectFolder>/dist/types/index.d.ts",
|
|
108
|
+
"apiReport": {
|
|
109
|
+
"enabled": true,
|
|
110
|
+
"reportFolder": "<projectFolder>/etc/"
|
|
111
|
+
},
|
|
112
|
+
"docModel": {
|
|
113
|
+
"enabled": true
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
# Generate initial API report (commit this file)
|
|
120
|
+
npx api-extractor run --local
|
|
121
|
+
|
|
122
|
+
# In CI: compare against committed report
|
|
123
|
+
npx api-extractor run
|
|
124
|
+
# Fails if the API surface changed in ways not reflected in the committed report
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
The generated `etc/my-library.api.md` file shows the complete public API surface in a reviewable format. When a PR changes it, reviewers can see exactly what changed. If the change is intentional, update the committed report; if not, fix the breaking change.
|
|
128
|
+
|
|
129
|
+
**API report excerpt:**
|
|
130
|
+
```markdown
|
|
131
|
+
// @public
|
|
132
|
+
export function parseConfig(input: string, options?: ParseOptions): Config;
|
|
133
|
+
|
|
134
|
+
// @public
|
|
135
|
+
export interface ParseOptions {
|
|
136
|
+
encoding?: BufferEncoding;
|
|
137
|
+
strict?: boolean;
|
|
138
|
+
}
|
|
139
|
+
|
|
140
|
+
// @public
|
|
141
|
+
export class ParseError extends Error {
|
|
142
|
+
constructor(message: string, line: number, column: number);
|
|
143
|
+
readonly column: number;
|
|
144
|
+
readonly line: number;
|
|
145
|
+
}
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
This format makes breaking changes immediately visible in code review.
|
|
149
|
+
|
|
150
|
+
### Pre-Release Channels
|
|
151
|
+
|
|
152
|
+
Pre-releases allow consumers to test upcoming changes without affecting stable installs:
|
|
153
|
+
|
|
154
|
+
**With changesets:**
|
|
155
|
+
```bash
|
|
156
|
+
# Enter pre-release mode
|
|
157
|
+
npx changeset pre enter alpha
|
|
158
|
+
# Or: beta, rc
|
|
159
|
+
|
|
160
|
+
# Create changesets and version as normal
|
|
161
|
+
npx changeset add
|
|
162
|
+
npx changeset version
|
|
163
|
+
# Produces: 2.0.0-alpha.1
|
|
164
|
+
|
|
165
|
+
# Exit pre-release mode
|
|
166
|
+
npx changeset pre exit
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Manual pre-release versioning:**
|
|
170
|
+
```json
|
|
171
|
+
// package.json
|
|
172
|
+
"version": "2.0.0-alpha.1"
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
npm publish --tag alpha
|
|
177
|
+
# Consumers opt-in: npm install my-library@alpha
|
|
178
|
+
# Stable consumers (npm install my-library) are unaffected
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
**Pre-release channel strategy:**
|
|
182
|
+
- `alpha` — internal testing only, may change drastically, no API stability
|
|
183
|
+
- `beta` — public testing, API reasonably stable, looking for feedback
|
|
184
|
+
- `rc` (release candidate) — API frozen, looking for final integration issues
|
|
185
|
+
- Stable — semver protected, change policy enforced
|
|
186
|
+
|
|
187
|
+
### Release Automation with GitHub Actions
|
|
188
|
+
|
|
189
|
+
Full release workflow with provenance and attestation:
|
|
190
|
+
|
|
191
|
+
```yaml
|
|
192
|
+
# .github/workflows/release.yml
|
|
193
|
+
# name: Release — CI publishes on tag push
|
|
194
|
+
on:
|
|
195
|
+
push:
|
|
196
|
+
tags:
|
|
197
|
+
- 'v*'
|
|
198
|
+
|
|
199
|
+
permissions:
|
|
200
|
+
contents: write
|
|
201
|
+
id-token: write # For npm provenance
|
|
202
|
+
|
|
203
|
+
jobs:
|
|
204
|
+
release:
|
|
205
|
+
runs-on: ubuntu-latest
|
|
206
|
+
steps:
|
|
207
|
+
- uses: actions/checkout@v4
|
|
208
|
+
|
|
209
|
+
- uses: actions/setup-node@v4
|
|
210
|
+
with:
|
|
211
|
+
node-version: '20'
|
|
212
|
+
registry-url: 'https://registry.npmjs.org'
|
|
213
|
+
|
|
214
|
+
- name: Install dependencies
|
|
215
|
+
run: npm ci
|
|
216
|
+
|
|
217
|
+
- name: Build
|
|
218
|
+
run: npm run build
|
|
219
|
+
|
|
220
|
+
- name: Test
|
|
221
|
+
run: npm test
|
|
222
|
+
|
|
223
|
+
- name: Publish to npm
|
|
224
|
+
run: npm publish --provenance --access public
|
|
225
|
+
env:
|
|
226
|
+
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
|
|
227
|
+
|
|
228
|
+
- name: Create GitHub Release
|
|
229
|
+
uses: softprops/action-gh-release@v2
|
|
230
|
+
with:
|
|
231
|
+
generate_release_notes: true
|
|
232
|
+
draft: false
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
The `--provenance` flag publishes npm provenance attestation — a cryptographic link between the published package and the GitHub Actions run that built it. This allows consumers to verify the package was built from the expected source.
|
|
236
|
+
|
|
237
|
+
### CHANGELOG Generation
|
|
238
|
+
|
|
239
|
+
Keep a Changelog format, managed automatically:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
# With conventional commits, generate changelog automatically:
|
|
243
|
+
npx conventional-changelog-cli -p angular -i CHANGELOG.md -s
|
|
244
|
+
|
|
245
|
+
# Or with changesets:
|
|
246
|
+
npx changeset version # Updates CHANGELOG.md automatically
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
**Manual CHANGELOG structure:**
|
|
250
|
+
```markdown
|
|
251
|
+
# Changelog
|
|
252
|
+
|
|
253
|
+
All notable changes to this project will be documented in this file.
|
|
254
|
+
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
|
|
255
|
+
Versioning: [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
|
|
256
|
+
|
|
257
|
+
## [Unreleased]
|
|
258
|
+
|
|
259
|
+
## [2.1.0] - 2024-03-15
|
|
260
|
+
|
|
261
|
+
### Added
|
|
262
|
+
- `parseConfigFile(path)` for file-based parsing
|
|
263
|
+
- `ParseOptions.maxSize` to limit input size
|
|
264
|
+
|
|
265
|
+
### Fixed
|
|
266
|
+
- `parseConfig()` no longer ignores `encoding` option for buffer inputs
|
|
267
|
+
|
|
268
|
+
## [2.0.0] - 2024-01-10
|
|
269
|
+
|
|
270
|
+
### Breaking Changes
|
|
271
|
+
- Removed `parse()` (deprecated in 1.5.0). Replacement: `parseConfig()`.
|
|
272
|
+
- `Config.timeout` is now milliseconds (was seconds). Multiply existing values × 1000.
|
|
273
|
+
- Dropped Node 16 support. Minimum: Node 18.
|
|
274
|
+
|
|
275
|
+
### Migration Guide
|
|
276
|
+
See: https://my-library.dev/guides/migration-v2
|
|
277
|
+
|
|
278
|
+
[Unreleased]: https://github.com/org/my-library/compare/v2.1.0...HEAD
|
|
279
|
+
[2.1.0]: https://github.com/org/my-library/compare/v2.0.0...v2.1.0
|
|
280
|
+
[2.0.0]: https://github.com/org/my-library/releases/tag/v2.0.0
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
### Git Tag Strategy
|
|
284
|
+
|
|
285
|
+
Tag every release:
|
|
286
|
+
```bash
|
|
287
|
+
# After publishing to npm
|
|
288
|
+
git tag v2.1.0
|
|
289
|
+
git push origin v2.1.0
|
|
290
|
+
|
|
291
|
+
# Or use npm version (updates package.json, commits, and tags)
|
|
292
|
+
npm version minor -m "chore(release): v%s"
|
|
293
|
+
git push && git push --tags
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
Tags are the source of truth for "what was published when." They enable:
|
|
297
|
+
- Reproducible builds from any historical version
|
|
298
|
+
- `git diff v2.0.0 v2.1.0` to review what changed between releases
|
|
299
|
+
- Automated changelog generation tools
|
|
300
|
+
- GitHub Release creation linked to the exact commit
|
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-architecture
|
|
3
|
+
description: Training/serving architecture split, feature stores, model registry, online vs offline inference patterns, and ML system design decisions
|
|
4
|
+
topics: [ml, architecture, feature-store, model-registry, inference, serving, training]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML systems have a fundamental architectural split that traditional software does not: the training system and the serving system are different codebases running on different infrastructure, yet they must agree on the exact same data transformations. This training-serving skew is the most common source of silent production bugs in ML. Designing the architecture to prevent skew — through shared feature stores, shared preprocessing libraries, and strict interface contracts — is the most important ML architecture decision.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML architecture separates training (batch, data-intensive, experimental) from serving (latency-sensitive, stateless, reliable). Feature stores eliminate training-serving skew by providing consistent feature computation for both. Model registries provide the governance layer: versioning, lineage, deployment gates, and rollback. Choose online vs. offline inference based on latency requirements and data freshness needs. Document every architectural decision as an ADR.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Training/Serving Architecture Split
|
|
16
|
+
|
|
17
|
+
The training system and serving system have fundamentally different requirements:
|
|
18
|
+
|
|
19
|
+
| Dimension | Training | Serving |
|
|
20
|
+
|-----------|----------|---------|
|
|
21
|
+
| Throughput | High (process TB of data) | High (handle thousands of RPS) |
|
|
22
|
+
| Latency | Not time-critical | P99 < 200ms |
|
|
23
|
+
| Consistency | Best-effort | Exact reproducibility |
|
|
24
|
+
| Infrastructure | Spot instances, GPU clusters | Reserved instances, autoscaling |
|
|
25
|
+
| State | Stateful (checkpoint, resume) | Stateless (each request independent) |
|
|
26
|
+
| Failures | Retry / resume from checkpoint | Circuit breaker, fallback |
|
|
27
|
+
|
|
28
|
+
The primary risk of this split is **training-serving skew**: the model was trained with feature X computed one way, but production computes feature X a slightly different way, leading to silent accuracy degradation.
|
|
29
|
+
|
|
30
|
+
**Prevention strategies**:
|
|
31
|
+
1. **Shared feature library**: Both training and serving import the same `src/features/` module for all feature computation. Never duplicate feature logic.
|
|
32
|
+
2. **Feature store**: Centralised store that computes features once and serves them to both training (historical) and serving (real-time) paths.
|
|
33
|
+
3. **Schema validation**: Validate model inputs at serving time against the schema seen during training.
|
|
34
|
+
|
|
35
|
+
### Feature Store Architecture
|
|
36
|
+
|
|
37
|
+
A feature store is a data infrastructure component that stores and serves pre-computed features:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
┌─────────────────┐
|
|
41
|
+
Data Sources ──────►│ Feature Pipeline │
|
|
42
|
+
└────────┬────────┘
|
|
43
|
+
│ compute features
|
|
44
|
+
┌────────▼────────┐
|
|
45
|
+
│ Feature Store │
|
|
46
|
+
│ ┌───────────┐ │
|
|
47
|
+
│ │ Offline │ │──► Training Data
|
|
48
|
+
│ │ (S3/GCS) │ │
|
|
49
|
+
│ ├───────────┤ │
|
|
50
|
+
│ │ Online │ │──► Serving (low-latency)
|
|
51
|
+
│ │ (Redis) │ │
|
|
52
|
+
└─────────────────┘
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**Offline store**: Historical features for training. Backed by object storage (S3, GCS) or a columnar database (BigQuery, Snowflake). Supports point-in-time correct feature retrieval (no data leakage).
|
|
56
|
+
|
|
57
|
+
**Online store**: Low-latency feature serving for real-time inference. Backed by Redis, DynamoDB, or Cassandra. Stores only the latest feature values.
|
|
58
|
+
|
|
59
|
+
**Implementations**: Feast (open source), Tecton, Vertex AI Feature Store, SageMaker Feature Store.
|
|
60
|
+
|
|
61
|
+
A feature store is justified when:
|
|
62
|
+
- Multiple models use the same features (DRY for features)
|
|
63
|
+
- Training-serving skew is causing production issues
|
|
64
|
+
- Feature computation is expensive and should be shared
|
|
65
|
+
|
|
66
|
+
### Model Registry
|
|
67
|
+
|
|
68
|
+
The model registry is the governance layer between training and production. Every model that has ever been trained should have a registry record:
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
Training Run ──► Model Artifacts ──► Registry ──► Deployment
|
|
72
|
+
(weights, config) (metadata, (staging,
|
|
73
|
+
lineage) production)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**Registry metadata per model version**:
|
|
77
|
+
- Model name and semantic version (`fraud-detector-v2.3.1`)
|
|
78
|
+
- Training run ID and commit SHA (full lineage)
|
|
79
|
+
- Training metrics (AUC, F1, etc.)
|
|
80
|
+
- Evaluation metrics on holdout sets
|
|
81
|
+
- Training dataset version
|
|
82
|
+
- Model schema (input/output feature names and types)
|
|
83
|
+
- Serving requirements (runtime, memory, GPU)
|
|
84
|
+
- Promotion history (who promoted, when, why)
|
|
85
|
+
|
|
86
|
+
**Lifecycle stages**:
|
|
87
|
+
```
|
|
88
|
+
None → Staging → Production → Archived
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
- Validation gates control promotion: automated tests (accuracy thresholds, latency budgets) plus optional human approval
|
|
92
|
+
- Production always has at least one previous version for rollback
|
|
93
|
+
- Archive on a schedule (keep 6 months of production versions)
|
|
94
|
+
|
|
95
|
+
**Implementations**: MLflow Model Registry, Weights & Biases Model Registry, SageMaker Model Registry.
|
|
96
|
+
|
|
97
|
+
### Online vs. Offline Inference
|
|
98
|
+
|
|
99
|
+
**Online inference** (real-time, synchronous):
|
|
100
|
+
- Model returns predictions in response to a request, typically within 100–500ms
|
|
101
|
+
- Examples: fraud scoring at checkout, recommendation on page load, search ranking
|
|
102
|
+
- Infrastructure: Model server (TorchServe, Triton, BentoML), autoscaled behind a load balancer
|
|
103
|
+
- Key considerations: latency budget, model size (must fit in serving memory), cold start time
|
|
104
|
+
|
|
105
|
+
**Offline inference** (batch, asynchronous):
|
|
106
|
+
- Model scores a large dataset on a schedule, stores predictions for later retrieval
|
|
107
|
+
- Examples: churn prediction for all users (run nightly), content pre-scoring for a recommendation cache
|
|
108
|
+
- Infrastructure: Spark job, Airflow DAG, Ray cluster, or simple Python script on a large VM
|
|
109
|
+
- Key considerations: throughput (records/second), data pipeline integration, prediction freshness
|
|
110
|
+
|
|
111
|
+
**Near-real-time / stream inference**:
|
|
112
|
+
- Model scores events from a stream (Kafka, Kinesis) with seconds-to-minutes latency
|
|
113
|
+
- Examples: anomaly detection on clickstream, session-level personalisation
|
|
114
|
+
- Infrastructure: Kafka consumer + model inference worker, Flink ML, or Spark Structured Streaming
|
|
115
|
+
- Key considerations: exactly-once semantics, ordering guarantees, backpressure handling
|
|
116
|
+
|
|
117
|
+
**Decision matrix**:
|
|
118
|
+
|
|
119
|
+
| Use Case | Latency Requirement | Data Freshness | Approach |
|
|
120
|
+
|----------|---------------------|----------------|----------|
|
|
121
|
+
| Checkout fraud | < 500ms | Real-time | Online inference |
|
|
122
|
+
| Churn prediction | N/A | Daily | Offline batch |
|
|
123
|
+
| Email personalisation | N/A (send time) | Daily | Offline batch |
|
|
124
|
+
| Feed ranking | < 200ms | Real-time | Online inference |
|
|
125
|
+
| Anomaly detection | < 30 seconds | Streaming | Stream inference |
|
|
126
|
+
|
|
127
|
+
### ML System Design Patterns
|
|
128
|
+
|
|
129
|
+
**Lambda Architecture for ML**:
|
|
130
|
+
- Batch layer: nightly model retraining or batch scoring
|
|
131
|
+
- Speed layer: real-time model updates or online scoring
|
|
132
|
+
- Serving layer: unified API serving precomputed (batch) + real-time predictions
|
|
133
|
+
- Complexity cost is high — evaluate whether the freshness gain justifies it
|
|
134
|
+
|
|
135
|
+
**Two-Tower Architecture** (recommendation systems):
|
|
136
|
+
- Candidate generation tower: fast approximate nearest-neighbour retrieval (ANN index)
|
|
137
|
+
- Ranking tower: expensive full model scoring of top-K candidates
|
|
138
|
+
- Separates recall (retrieve thousands of candidates quickly) from precision (rank them accurately)
|
|
139
|
+
|
|
140
|
+
**Shadow Mode Deployment**:
|
|
141
|
+
- New model runs in parallel with production model, receiving real traffic
|
|
142
|
+
- New model's predictions are not served but are logged and evaluated
|
|
143
|
+
- Safe way to validate new models with real data before full deployment
|
|
144
|
+
|
|
145
|
+
### Architecture Decision Record Template for ML
|
|
146
|
+
|
|
147
|
+
```markdown
|
|
148
|
+
# ADR-ML-001: Online vs. Offline Inference for Recommendation
|
|
149
|
+
|
|
150
|
+
## Status
|
|
151
|
+
Accepted — 2024-03-15
|
|
152
|
+
|
|
153
|
+
## Context
|
|
154
|
+
Product recommends items to users on homepage. 5M daily active users.
|
|
155
|
+
Data team produces updated user embeddings daily. Item catalog updates hourly.
|
|
156
|
+
|
|
157
|
+
## Decision
|
|
158
|
+
Use offline batch inference for recommendation.
|
|
159
|
+
- Nightly batch job scores all user-item pairs for top 100 candidates
|
|
160
|
+
- Recommendations stored in Redis with TTL of 26 hours
|
|
161
|
+
- API reads from Redis — zero model inference at request time
|
|
162
|
+
|
|
163
|
+
## Consequences
|
|
164
|
+
- P99 homepage latency drops from 450ms to 20ms (Redis lookup vs. model inference)
|
|
165
|
+
- Recommendations are up to 26 hours stale — acceptable given content update frequency
|
|
166
|
+
- Requires Redis cluster (operational cost)
|
|
167
|
+
- Model update cycle is daily — cannot react to same-session behaviour
|
|
168
|
+
|
|
169
|
+
## Alternatives Rejected
|
|
170
|
+
- Online inference: latency budget exceeded (model P99 = 380ms)
|
|
171
|
+
- Streaming inference: engineering complexity not justified given daily embedding update cadence
|
|
172
|
+
```
|
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-conventions
|
|
3
|
+
description: Experiment naming, model versioning, reproducibility via random seeds, config-as-code patterns, and team conventions for ML projects
|
|
4
|
+
topics: [ml, conventions, reproducibility, versioning, config, experiments]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML projects without conventions degenerate into chaos within weeks: unnamed experiments with lost hyperparameters, models named `model_v2_final_FINAL.pkl`, and results that cannot be reproduced. Unlike software engineering where the compiler enforces structure, ML workflows are loose scripts and notebooks that require disciplined conventions to remain comprehensible. Establish these conventions at project start and encode them in tooling so they are followed by default, not willpower.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML conventions cover experiment naming (structured, searchable identifiers), model versioning (semantic or content-addressed), reproducibility (seeding all random sources, recording environment), and config-as-code (no magic numbers in code, all hyperparameters in config files). These conventions are not optional hygiene — they are the infrastructure that makes ML engineering a repeatable discipline rather than a research lottery.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Experiment Naming
|
|
16
|
+
|
|
17
|
+
Every training run that produces artifacts or results must have a unique, human-readable identifier. Ad-hoc names like `test`, `v2`, or `new_model` are unusable at scale:
|
|
18
|
+
|
|
19
|
+
**Recommended format**: `{model_type}-{dataset}-{date}-{purpose}[-{variant}]`
|
|
20
|
+
|
|
21
|
+
Examples:
|
|
22
|
+
- `resnet50-imagenet-20240315-baseline`
|
|
23
|
+
- `bert-sst2-20240315-lr-sweep`
|
|
24
|
+
- `xgboost-churn-20240320-feature-v3`
|
|
25
|
+
- `gpt2-reviews-20240322-dropout-ablation`
|
|
26
|
+
|
|
27
|
+
**Rules**:
|
|
28
|
+
- All lowercase, hyphen-separated (no spaces, no underscores)
|
|
29
|
+
- Date in `YYYYMMDD` format (sorts chronologically)
|
|
30
|
+
- Purpose is human-readable and specific — not `experiment1` or `test`
|
|
31
|
+
- Variant suffix for ablations and sweeps (`-v2`, `-no-dropout`, `-lr-1e-3`)
|
|
32
|
+
|
|
33
|
+
Many teams use auto-generated experiment IDs (MLflow assigns UUID-based IDs automatically) and rely on tagging/metadata for search. This is fine as a secondary system, but always add a human-readable display name.
|
|
34
|
+
|
|
35
|
+
### Model Versioning
|
|
36
|
+
|
|
37
|
+
Model versioning is distinct from experiment tracking. A version is a production artifact; an experiment is a training run:
|
|
38
|
+
|
|
39
|
+
**Semantic versioning for models**:
|
|
40
|
+
- `v{major}.{minor}.{patch}` — consistent with software versioning
|
|
41
|
+
- Major: Breaking change in model interface (input/output schema, preprocessing contract)
|
|
42
|
+
- Minor: Meaningful accuracy improvement or new feature support
|
|
43
|
+
- Patch: Bug fix, minor data update, no interface change
|
|
44
|
+
|
|
45
|
+
**Content-addressed versioning** (used by MLflow Model Registry, DVC):
|
|
46
|
+
- Models are identified by a hash of their weights + config
|
|
47
|
+
- Prevents accidental overwriting
|
|
48
|
+
- Enables exact reproducibility — "which weights produced this prediction?"
|
|
49
|
+
|
|
50
|
+
**Registry-based model lifecycle**:
|
|
51
|
+
```
|
|
52
|
+
Staging → Validation → Production → Archived
|
|
53
|
+
```
|
|
54
|
+
- Never promote directly to Production — always pass through Staging validation
|
|
55
|
+
- Keep at least one previous Production version for instant rollback
|
|
56
|
+
- Document promotion reason: "Promoted: +2.3% AUC on Q1 eval set, latency within budget"
|
|
57
|
+
|
|
58
|
+
### Reproducibility
|
|
59
|
+
|
|
60
|
+
ML reproducibility means: given the same code, data, and config, the same model is produced. Achieve it through four controls:
|
|
61
|
+
|
|
62
|
+
**1. Random seed management**
|
|
63
|
+
|
|
64
|
+
Set all random sources before any computation:
|
|
65
|
+
```python
|
|
66
|
+
import random
|
|
67
|
+
import numpy as np
|
|
68
|
+
import torch
|
|
69
|
+
|
|
70
|
+
def set_seed(seed: int) -> None:
|
|
71
|
+
random.seed(seed)
|
|
72
|
+
np.random.seed(seed)
|
|
73
|
+
torch.manual_seed(seed)
|
|
74
|
+
torch.cuda.manual_seed_all(seed)
|
|
75
|
+
# For full determinism (may impact performance)
|
|
76
|
+
torch.backends.cudnn.deterministic = True
|
|
77
|
+
torch.backends.cudnn.benchmark = False
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Record the seed in experiment config. Default seed: `42` (or any fixed value — consistency matters more than the value). When running hyperparameter sweeps with multiple seeds, record all seeds and report mean ± std.
|
|
81
|
+
|
|
82
|
+
**2. Dependency pinning**
|
|
83
|
+
|
|
84
|
+
Pin all dependencies to exact versions:
|
|
85
|
+
```toml
|
|
86
|
+
# pyproject.toml (Poetry)
|
|
87
|
+
[tool.poetry.dependencies]
|
|
88
|
+
python = "3.11.4"
|
|
89
|
+
torch = "2.1.0"
|
|
90
|
+
transformers = "4.35.2"
|
|
91
|
+
numpy = "1.26.0"
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Use `poetry.lock` or `requirements.txt` generated by `pip freeze`. Never use unpinned dependencies (`torch>=2.0`) in a training environment.
|
|
95
|
+
|
|
96
|
+
**3. Data versioning**
|
|
97
|
+
|
|
98
|
+
Record the exact dataset version used for each training run:
|
|
99
|
+
- DVC: content-addressed data with `dvc add` and `.dvc` pointers
|
|
100
|
+
- Dataset registry: log dataset name + version + hash in experiment metadata
|
|
101
|
+
- SQL-based datasets: log the query hash and execution timestamp
|
|
102
|
+
|
|
103
|
+
**4. Environment reproducibility**
|
|
104
|
+
|
|
105
|
+
Capture the full environment:
|
|
106
|
+
```bash
|
|
107
|
+
# Save environment
|
|
108
|
+
conda env export > environment.yml
|
|
109
|
+
pip freeze > requirements-frozen.txt
|
|
110
|
+
|
|
111
|
+
# Record GPU driver and CUDA version
|
|
112
|
+
nvidia-smi --query-gpu=driver_version,name --format=csv
|
|
113
|
+
nvcc --version
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
For full environment isolation, use Docker. The Dockerfile is the environment specification.
|
|
117
|
+
|
|
118
|
+
### Config-as-Code
|
|
119
|
+
|
|
120
|
+
No magic numbers in code. Every hyperparameter, data path, and training setting belongs in a config file:
|
|
121
|
+
|
|
122
|
+
**Bad** (magic numbers scattered in code):
|
|
123
|
+
```python
|
|
124
|
+
optimizer = Adam(model.parameters(), lr=0.001)
|
|
125
|
+
scheduler = CosineAnnealingLR(optimizer, T_max=100)
|
|
126
|
+
train_loader = DataLoader(dataset, batch_size=32, num_workers=4)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Good** (config-driven):
|
|
130
|
+
```yaml
|
|
131
|
+
# configs/train.yaml
|
|
132
|
+
training:
|
|
133
|
+
seed: 42
|
|
134
|
+
epochs: 100
|
|
135
|
+
batch_size: 32
|
|
136
|
+
num_workers: 4
|
|
137
|
+
|
|
138
|
+
optimizer:
|
|
139
|
+
type: adam
|
|
140
|
+
lr: 1.0e-3
|
|
141
|
+
weight_decay: 1.0e-4
|
|
142
|
+
|
|
143
|
+
scheduler:
|
|
144
|
+
type: cosine_annealing
|
|
145
|
+
t_max: 100
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
# src/training/train.py
|
|
150
|
+
def train(cfg: DictConfig) -> None:
|
|
151
|
+
set_seed(cfg.training.seed)
|
|
152
|
+
optimizer = build_optimizer(model, cfg.optimizer)
|
|
153
|
+
scheduler = build_scheduler(optimizer, cfg.scheduler)
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Use **Hydra** (Meta) or **OmegaConf** for hierarchical config management with CLI override support:
|
|
157
|
+
```bash
|
|
158
|
+
# Override from CLI without changing config files
|
|
159
|
+
python train.py optimizer.lr=1e-4 training.batch_size=64
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
**Config file organization**:
|
|
163
|
+
```
|
|
164
|
+
configs/
|
|
165
|
+
base.yaml # Default config for all experiments
|
|
166
|
+
model/
|
|
167
|
+
resnet50.yaml
|
|
168
|
+
vit-b16.yaml
|
|
169
|
+
data/
|
|
170
|
+
imagenet.yaml
|
|
171
|
+
cifar10.yaml
|
|
172
|
+
training/
|
|
173
|
+
fast.yaml # Low-epoch for debugging
|
|
174
|
+
full.yaml # Production training
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Code and Notebook Conventions
|
|
178
|
+
|
|
179
|
+
**Notebooks are for exploration, not production**:
|
|
180
|
+
- Notebooks belong in `notebooks/` — never in `src/`
|
|
181
|
+
- Notebooks must be cleared before committing (no large outputs committed to git)
|
|
182
|
+
- Meaningful results from notebooks are refactored into `src/` modules with tests
|
|
183
|
+
|
|
184
|
+
**Module structure conventions**:
|
|
185
|
+
- `src/data/` — dataset classes, data loaders, preprocessing transforms
|
|
186
|
+
- `src/models/` — model architectures (no training logic)
|
|
187
|
+
- `src/training/` — training loop, loss functions, callbacks
|
|
188
|
+
- `src/evaluation/` — metrics, evaluation runners
|
|
189
|
+
- `src/serving/` — inference code, prediction pipelines
|
|
190
|
+
|
|
191
|
+
**Naming conventions**:
|
|
192
|
+
- Files: `snake_case.py`
|
|
193
|
+
- Classes: `PascalCase` (e.g., `ResNet50Classifier`, `ChurnDataset`)
|
|
194
|
+
- Functions: `snake_case` (e.g., `compute_f1_score`, `load_checkpoint`)
|
|
195
|
+
- Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LENGTH = 512`)
|
|
196
|
+
- Config keys: `snake_case` in YAML
|
|
197
|
+
|
|
198
|
+
### Checklist Before Starting a Training Run
|
|
199
|
+
|
|
200
|
+
```
|
|
201
|
+
[ ] Experiment name follows naming convention
|
|
202
|
+
[ ] Random seed set and recorded in config
|
|
203
|
+
[ ] Config file committed (not just command-line overrides)
|
|
204
|
+
[ ] Dataset version recorded
|
|
205
|
+
[ ] Experiment tracker (MLflow/W&B) initialized with run metadata
|
|
206
|
+
[ ] Code committed to git (note the commit SHA in the experiment)
|
|
207
|
+
[ ] Output directory created and named consistently
|
|
208
|
+
[ ] Hardware/environment recorded
|
|
209
|
+
```
|