@zigrivers/scaffold 3.8.0 → 3.9.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +75 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +167 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-conventions
|
|
3
|
+
description: Experiment naming, model versioning, reproducibility via random seeds, config-as-code patterns, and team conventions for ML projects
|
|
4
|
+
topics: [ml, conventions, reproducibility, versioning, config, experiments]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML projects without conventions degenerate into chaos within weeks: unnamed experiments with lost hyperparameters, models named `model_v2_final_FINAL.pkl`, and results that cannot be reproduced. Unlike software engineering where the compiler enforces structure, ML workflows are loose scripts and notebooks that require disciplined conventions to remain comprehensible. Establish these conventions at project start and encode them in tooling so they are followed by default, not willpower.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML conventions cover experiment naming (structured, searchable identifiers), model versioning (semantic or content-addressed), reproducibility (seeding all random sources, recording environment), and config-as-code (no magic numbers in code, all hyperparameters in config files). These conventions are not optional hygiene — they are the infrastructure that makes ML engineering a repeatable discipline rather than a research lottery.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Experiment Naming
|
|
16
|
+
|
|
17
|
+
Every training run that produces artifacts or results must have a unique, human-readable identifier. Ad-hoc names like `test`, `v2`, or `new_model` are unusable at scale:
|
|
18
|
+
|
|
19
|
+
**Recommended format**: `{model_type}-{dataset}-{date}-{purpose}[-{variant}]`
|
|
20
|
+
|
|
21
|
+
Examples:
|
|
22
|
+
- `resnet50-imagenet-20240315-baseline`
|
|
23
|
+
- `bert-sst2-20240315-lr-sweep`
|
|
24
|
+
- `xgboost-churn-20240320-feature-v3`
|
|
25
|
+
- `gpt2-reviews-20240322-dropout-ablation`
|
|
26
|
+
|
|
27
|
+
**Rules**:
|
|
28
|
+
- All lowercase, hyphen-separated (no spaces, no underscores)
|
|
29
|
+
- Date in `YYYYMMDD` format (sorts chronologically)
|
|
30
|
+
- Purpose is human-readable and specific — not `experiment1` or `test`
|
|
31
|
+
- Variant suffix for ablations and sweeps (`-v2`, `-no-dropout`, `-lr-1e-3`)
|
|
32
|
+
|
|
33
|
+
Many teams use auto-generated experiment IDs (MLflow assigns UUID-based IDs automatically) and rely on tagging/metadata for search. This is fine as a secondary system, but always add a human-readable display name.
|
|
34
|
+
|
|
35
|
+
### Model Versioning
|
|
36
|
+
|
|
37
|
+
Model versioning is distinct from experiment tracking. A version is a production artifact; an experiment is a training run:
|
|
38
|
+
|
|
39
|
+
**Semantic versioning for models**:
|
|
40
|
+
- `v{major}.{minor}.{patch}` — consistent with software versioning
|
|
41
|
+
- Major: Breaking change in model interface (input/output schema, preprocessing contract)
|
|
42
|
+
- Minor: Meaningful accuracy improvement or new feature support
|
|
43
|
+
- Patch: Bug fix, minor data update, no interface change
|
|
44
|
+
|
|
45
|
+
**Content-addressed versioning** (used by MLflow Model Registry, DVC):
|
|
46
|
+
- Models are identified by a hash of their weights + config
|
|
47
|
+
- Prevents accidental overwriting
|
|
48
|
+
- Enables exact reproducibility — "which weights produced this prediction?"
|
|
49
|
+
|
|
50
|
+
**Registry-based model lifecycle**:
|
|
51
|
+
```
|
|
52
|
+
Staging → Validation → Production → Archived
|
|
53
|
+
```
|
|
54
|
+
- Never promote directly to Production — always pass through Staging validation
|
|
55
|
+
- Keep at least one previous Production version for instant rollback
|
|
56
|
+
- Document promotion reason: "Promoted: +2.3% AUC on Q1 eval set, latency within budget"
|
|
57
|
+
|
|
58
|
+
### Reproducibility
|
|
59
|
+
|
|
60
|
+
ML reproducibility means: given the same code, data, and config, the same model is produced. Achieve it through four controls:
|
|
61
|
+
|
|
62
|
+
**1. Random seed management**
|
|
63
|
+
|
|
64
|
+
Set all random sources before any computation:
|
|
65
|
+
```python
|
|
66
|
+
import random
|
|
67
|
+
import numpy as np
|
|
68
|
+
import torch
|
|
69
|
+
|
|
70
|
+
def set_seed(seed: int) -> None:
|
|
71
|
+
random.seed(seed)
|
|
72
|
+
np.random.seed(seed)
|
|
73
|
+
torch.manual_seed(seed)
|
|
74
|
+
torch.cuda.manual_seed_all(seed)
|
|
75
|
+
# For full determinism (may impact performance)
|
|
76
|
+
torch.backends.cudnn.deterministic = True
|
|
77
|
+
torch.backends.cudnn.benchmark = False
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Record the seed in experiment config. Default seed: `42` (or any fixed value — consistency matters more than the value). When running hyperparameter sweeps with multiple seeds, record all seeds and report mean ± std.
|
|
81
|
+
|
|
82
|
+
**2. Dependency pinning**
|
|
83
|
+
|
|
84
|
+
Pin all dependencies to exact versions:
|
|
85
|
+
```toml
|
|
86
|
+
# pyproject.toml (Poetry)
|
|
87
|
+
[tool.poetry.dependencies]
|
|
88
|
+
python = "3.11.4"
|
|
89
|
+
torch = "2.1.0"
|
|
90
|
+
transformers = "4.35.2"
|
|
91
|
+
numpy = "1.26.0"
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Use `poetry.lock` or `requirements.txt` generated by `pip freeze`. Never use unpinned dependencies (`torch>=2.0`) in a training environment.
|
|
95
|
+
|
|
96
|
+
**3. Data versioning**
|
|
97
|
+
|
|
98
|
+
Record the exact dataset version used for each training run:
|
|
99
|
+
- DVC: content-addressed data with `dvc add` and `.dvc` pointers
|
|
100
|
+
- Dataset registry: log dataset name + version + hash in experiment metadata
|
|
101
|
+
- SQL-based datasets: log the query hash and execution timestamp
|
|
102
|
+
|
|
103
|
+
**4. Environment reproducibility**
|
|
104
|
+
|
|
105
|
+
Capture the full environment:
|
|
106
|
+
```bash
|
|
107
|
+
# Save environment
|
|
108
|
+
conda env export > environment.yml
|
|
109
|
+
pip freeze > requirements-frozen.txt
|
|
110
|
+
|
|
111
|
+
# Record GPU driver and CUDA version
|
|
112
|
+
nvidia-smi --query-gpu=driver_version,name --format=csv
|
|
113
|
+
nvcc --version
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
For full environment isolation, use Docker. The Dockerfile is the environment specification.
|
|
117
|
+
|
|
118
|
+
### Config-as-Code
|
|
119
|
+
|
|
120
|
+
No magic numbers in code. Every hyperparameter, data path, and training setting belongs in a config file:
|
|
121
|
+
|
|
122
|
+
**Bad** (magic numbers scattered in code):
|
|
123
|
+
```python
|
|
124
|
+
optimizer = Adam(model.parameters(), lr=0.001)
|
|
125
|
+
scheduler = CosineAnnealingLR(optimizer, T_max=100)
|
|
126
|
+
train_loader = DataLoader(dataset, batch_size=32, num_workers=4)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Good** (config-driven):
|
|
130
|
+
```yaml
|
|
131
|
+
# configs/train.yaml
|
|
132
|
+
training:
|
|
133
|
+
seed: 42
|
|
134
|
+
epochs: 100
|
|
135
|
+
batch_size: 32
|
|
136
|
+
num_workers: 4
|
|
137
|
+
|
|
138
|
+
optimizer:
|
|
139
|
+
type: adam
|
|
140
|
+
lr: 1.0e-3
|
|
141
|
+
weight_decay: 1.0e-4
|
|
142
|
+
|
|
143
|
+
scheduler:
|
|
144
|
+
type: cosine_annealing
|
|
145
|
+
t_max: 100
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
# src/training/train.py
|
|
150
|
+
def train(cfg: DictConfig) -> None:
|
|
151
|
+
set_seed(cfg.training.seed)
|
|
152
|
+
optimizer = build_optimizer(model, cfg.optimizer)
|
|
153
|
+
scheduler = build_scheduler(optimizer, cfg.scheduler)
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Use **Hydra** (Meta) or **OmegaConf** for hierarchical config management with CLI override support:
|
|
157
|
+
```bash
|
|
158
|
+
# Override from CLI without changing config files
|
|
159
|
+
python train.py optimizer.lr=1e-4 training.batch_size=64
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
**Config file organization**:
|
|
163
|
+
```
|
|
164
|
+
configs/
|
|
165
|
+
base.yaml # Default config for all experiments
|
|
166
|
+
model/
|
|
167
|
+
resnet50.yaml
|
|
168
|
+
vit-b16.yaml
|
|
169
|
+
data/
|
|
170
|
+
imagenet.yaml
|
|
171
|
+
cifar10.yaml
|
|
172
|
+
training/
|
|
173
|
+
fast.yaml # Low-epoch for debugging
|
|
174
|
+
full.yaml # Production training
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Code and Notebook Conventions
|
|
178
|
+
|
|
179
|
+
**Notebooks are for exploration, not production**:
|
|
180
|
+
- Notebooks belong in `notebooks/` — never in `src/`
|
|
181
|
+
- Notebooks must be cleared before committing (no large outputs committed to git)
|
|
182
|
+
- Meaningful results from notebooks are refactored into `src/` modules with tests
|
|
183
|
+
|
|
184
|
+
**Module structure conventions**:
|
|
185
|
+
- `src/data/` — dataset classes, data loaders, preprocessing transforms
|
|
186
|
+
- `src/models/` — model architectures (no training logic)
|
|
187
|
+
- `src/training/` — training loop, loss functions, callbacks
|
|
188
|
+
- `src/evaluation/` — metrics, evaluation runners
|
|
189
|
+
- `src/serving/` — inference code, prediction pipelines
|
|
190
|
+
|
|
191
|
+
**Naming conventions**:
|
|
192
|
+
- Files: `snake_case.py`
|
|
193
|
+
- Classes: `PascalCase` (e.g., `ResNet50Classifier`, `ChurnDataset`)
|
|
194
|
+
- Functions: `snake_case` (e.g., `compute_f1_score`, `load_checkpoint`)
|
|
195
|
+
- Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LENGTH = 512`)
|
|
196
|
+
- Config keys: `snake_case` in YAML
|
|
197
|
+
|
|
198
|
+
### Checklist Before Starting a Training Run
|
|
199
|
+
|
|
200
|
+
```
|
|
201
|
+
[ ] Experiment name follows naming convention
|
|
202
|
+
[ ] Random seed set and recorded in config
|
|
203
|
+
[ ] Config file committed (not just command-line overrides)
|
|
204
|
+
[ ] Dataset version recorded
|
|
205
|
+
[ ] Experiment tracker (MLflow/W&B) initialized with run metadata
|
|
206
|
+
[ ] Code committed to git (note the commit SHA in the experiment)
|
|
207
|
+
[ ] Output directory created and named consistently
|
|
208
|
+
[ ] Hardware/environment recorded
|
|
209
|
+
```
|
|
@@ -0,0 +1,299 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-dev-environment
|
|
3
|
+
description: Conda/Poetry environment setup, Jupyter integration, GPU detection and configuration, and Docker for reproducible ML development
|
|
4
|
+
topics: [ml, dev-environment, conda, poetry, jupyter, gpu, docker, reproducibility]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML development environments have more complexity than typical software projects: GPU drivers, CUDA toolkits, Python packages with native extensions, and Jupyter notebook infrastructure all need to align. A broken environment costs hours and blocks the whole team. Invest in environment standardisation upfront — the payoff is that every team member can reproduce results and that CI pipelines match local runs.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Prefer Conda for ML projects when GPU and CUDA management is required; use Poetry for pure-Python projects or as the Python dependency manager on top of Conda. Configure Jupyter as a managed service rather than ad-hoc invocations. Detect GPU availability programmatically and handle CPU fallback gracefully. Use Docker to capture the full environment for reproducible training runs and production serving.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Conda vs. Poetry: When to Use Each
|
|
16
|
+
|
|
17
|
+
**Conda** is the right choice when:
|
|
18
|
+
- Managing GPU drivers and CUDA toolkit versions (Conda can install CUDA without root)
|
|
19
|
+
- Working with packages that have complex native dependencies (PyTorch, TensorFlow, OpenCV)
|
|
20
|
+
- Need to isolate Python version itself (not just packages)
|
|
21
|
+
- Team uses multiple ML frameworks with conflicting dependencies
|
|
22
|
+
|
|
23
|
+
**Poetry** is the right choice when:
|
|
24
|
+
- Pure-Python project or all native dependencies are available via pip
|
|
25
|
+
- Need strict dependency locking and reproducible installs
|
|
26
|
+
- Publishing a library (Poetry handles packaging well)
|
|
27
|
+
- Already using a Conda environment for CUDA and want finer control over Python packages
|
|
28
|
+
|
|
29
|
+
**Common hybrid pattern**: Conda manages Python version and CUDA; Poetry manages Python package dependencies inside the Conda environment.
|
|
30
|
+
|
|
31
|
+
### Conda Environment Setup
|
|
32
|
+
|
|
33
|
+
```yaml
|
|
34
|
+
# environment.yml — commit to git
|
|
35
|
+
name: myproject
|
|
36
|
+
channels:
|
|
37
|
+
- pytorch
|
|
38
|
+
- nvidia
|
|
39
|
+
- conda-forge
|
|
40
|
+
- defaults
|
|
41
|
+
dependencies:
|
|
42
|
+
- python=3.11
|
|
43
|
+
- cuda-toolkit=12.1
|
|
44
|
+
- cudnn=8.9
|
|
45
|
+
- pip>=23.0
|
|
46
|
+
- pip:
|
|
47
|
+
- torch==2.1.0+cu121
|
|
48
|
+
- torchvision==0.16.0+cu121
|
|
49
|
+
- -r requirements.txt # or use pyproject.toml
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
# Create and activate
|
|
54
|
+
conda env create -f environment.yml
|
|
55
|
+
conda activate myproject
|
|
56
|
+
|
|
57
|
+
# Update after environment.yml changes
|
|
58
|
+
conda env update -f environment.yml --prune
|
|
59
|
+
|
|
60
|
+
# Export current state (for exact reproducibility audit)
|
|
61
|
+
conda env export > environment-lock.yml
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
**Critical**: Pin exact versions in `environment.yml`. `pytorch>=2.0` is not a reproducible spec.
|
|
65
|
+
|
|
66
|
+
### Poetry Setup (Python Dependencies)
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
# Initialize
|
|
70
|
+
poetry init
|
|
71
|
+
|
|
72
|
+
# Add dependencies
|
|
73
|
+
poetry add torch==2.1.0 transformers==4.35.2
|
|
74
|
+
poetry add --group dev pytest black mypy
|
|
75
|
+
|
|
76
|
+
# Install (creates .venv by default)
|
|
77
|
+
poetry install
|
|
78
|
+
|
|
79
|
+
# Run in the managed venv
|
|
80
|
+
poetry run python train.py
|
|
81
|
+
poetry run pytest
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
`pyproject.toml` example:
|
|
85
|
+
```toml
|
|
86
|
+
[tool.poetry]
|
|
87
|
+
name = "myproject"
|
|
88
|
+
version = "0.1.0"
|
|
89
|
+
description = "ML project"
|
|
90
|
+
python = "^3.11"
|
|
91
|
+
|
|
92
|
+
[tool.poetry.dependencies]
|
|
93
|
+
torch = "2.1.0"
|
|
94
|
+
transformers = "4.35.2"
|
|
95
|
+
hydra-core = "1.3.2"
|
|
96
|
+
mlflow = "2.9.2"
|
|
97
|
+
|
|
98
|
+
[tool.poetry.group.dev.dependencies]
|
|
99
|
+
pytest = "7.4.3"
|
|
100
|
+
black = "23.11.0"
|
|
101
|
+
mypy = "1.7.0"
|
|
102
|
+
nbstripout = "0.6.1"
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### GPU Detection and Configuration
|
|
106
|
+
|
|
107
|
+
Always detect GPU availability at runtime and handle CPU fallback:
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
# src/utils/device.py
|
|
111
|
+
import torch
|
|
112
|
+
import logging
|
|
113
|
+
|
|
114
|
+
logger = logging.getLogger(__name__)
|
|
115
|
+
|
|
116
|
+
def get_device(prefer_gpu: bool = True) -> torch.device:
|
|
117
|
+
"""Return the best available device with logging."""
|
|
118
|
+
if prefer_gpu and torch.cuda.is_available():
|
|
119
|
+
device = torch.device("cuda")
|
|
120
|
+
gpu_name = torch.cuda.get_device_name(0)
|
|
121
|
+
gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
|
|
122
|
+
logger.info(f"Using GPU: {gpu_name} ({gpu_memory:.1f} GB)")
|
|
123
|
+
elif prefer_gpu and torch.backends.mps.is_available():
|
|
124
|
+
# Apple Silicon
|
|
125
|
+
device = torch.device("mps")
|
|
126
|
+
logger.info("Using Apple MPS device")
|
|
127
|
+
else:
|
|
128
|
+
device = torch.device("cpu")
|
|
129
|
+
logger.info("Using CPU — GPU not available or not requested")
|
|
130
|
+
return device
|
|
131
|
+
|
|
132
|
+
def log_gpu_memory() -> None:
|
|
133
|
+
"""Log current GPU memory usage."""
|
|
134
|
+
if torch.cuda.is_available():
|
|
135
|
+
allocated = torch.cuda.memory_allocated() / 1e9
|
|
136
|
+
reserved = torch.cuda.memory_reserved() / 1e9
|
|
137
|
+
logger.debug(f"GPU memory: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved")
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**CUDA version compatibility**: PyTorch packages are built against specific CUDA versions. Always match:
|
|
141
|
+
|
|
142
|
+
| PyTorch | CUDA | CUDNN |
|
|
143
|
+
|---------|------|-------|
|
|
144
|
+
| 2.1.x | 12.1, 11.8 | 8.x |
|
|
145
|
+
| 2.0.x | 11.7, 11.8 | 8.x |
|
|
146
|
+
|
|
147
|
+
Check compatibility at pytorch.org before pinning.
|
|
148
|
+
|
|
149
|
+
**Multi-GPU setup** (training only — not for development):
|
|
150
|
+
```python
|
|
151
|
+
# Detect available GPUs
|
|
152
|
+
n_gpus = torch.cuda.device_count()
|
|
153
|
+
if n_gpus > 1:
|
|
154
|
+
model = torch.nn.DataParallel(model) # Simple, for research
|
|
155
|
+
# Or for production: use DistributedDataParallel (see ml-training-patterns)
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Jupyter Integration
|
|
159
|
+
|
|
160
|
+
Run Jupyter as a managed kernel rather than an ad-hoc server:
|
|
161
|
+
|
|
162
|
+
```bash
|
|
163
|
+
# Install Jupyter in the project environment
|
|
164
|
+
poetry add --group dev jupyter jupyterlab ipykernel
|
|
165
|
+
|
|
166
|
+
# Register the project venv as a named Jupyter kernel
|
|
167
|
+
poetry run python -m ipykernel install --user --name myproject --display-name "MyProject (Python 3.11)"
|
|
168
|
+
|
|
169
|
+
# Launch JupyterLab
|
|
170
|
+
poetry run jupyter lab
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Now all project notebooks run in the same environment as the source code.
|
|
174
|
+
|
|
175
|
+
**Recommended Jupyter extensions**:
|
|
176
|
+
- `nbstripout` — strips outputs before git commit
|
|
177
|
+
- `jupyterlab-git` — git integration in the UI
|
|
178
|
+
- `jupyterlab-lsp` — language server (autocomplete, type hints)
|
|
179
|
+
|
|
180
|
+
**VS Code Jupyter integration** (recommended over browser-based):
|
|
181
|
+
```json
|
|
182
|
+
// .vscode/settings.json
|
|
183
|
+
{
|
|
184
|
+
"jupyter.kernels.filter": [
|
|
185
|
+
{"path": "${workspaceFolder}/.venv/bin/python", "type": "pythonEnvironment"}
|
|
186
|
+
],
|
|
187
|
+
"jupyter.notebookFileRoot": "${workspaceFolder}",
|
|
188
|
+
"python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python"
|
|
189
|
+
}
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### Docker for Reproducibility
|
|
193
|
+
|
|
194
|
+
Docker captures the entire environment — OS, CUDA, Python, and packages. Use it for:
|
|
195
|
+
- CI training runs
|
|
196
|
+
- Sharing experiments with collaborators who have different local setups
|
|
197
|
+
- Production serving (identical environment to training)
|
|
198
|
+
|
|
199
|
+
**Base `Dockerfile` for ML training**:
|
|
200
|
+
```dockerfile
|
|
201
|
+
# Use NVIDIA's official CUDA base image
|
|
202
|
+
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
|
|
203
|
+
|
|
204
|
+
# Set Python version
|
|
205
|
+
ENV PYTHON_VERSION=3.11
|
|
206
|
+
ENV DEBIAN_FRONTEND=noninteractive
|
|
207
|
+
|
|
208
|
+
RUN apt-get update && apt-get install -y \
|
|
209
|
+
python${PYTHON_VERSION} \
|
|
210
|
+
python3-pip \
|
|
211
|
+
git \
|
|
212
|
+
&& rm -rf /var/lib/apt/lists/*
|
|
213
|
+
|
|
214
|
+
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
|
|
215
|
+
|
|
216
|
+
# Install Poetry
|
|
217
|
+
RUN pip install poetry==1.7.1
|
|
218
|
+
ENV POETRY_NO_INTERACTION=1 \
|
|
219
|
+
POETRY_VENV_IN_PROJECT=1
|
|
220
|
+
|
|
221
|
+
WORKDIR /app
|
|
222
|
+
|
|
223
|
+
# Install dependencies (cached layer)
|
|
224
|
+
COPY pyproject.toml poetry.lock ./
|
|
225
|
+
RUN poetry install --no-root --without dev
|
|
226
|
+
|
|
227
|
+
# Copy source
|
|
228
|
+
COPY src/ ./src/
|
|
229
|
+
COPY configs/ ./configs/
|
|
230
|
+
|
|
231
|
+
# Install the project itself
|
|
232
|
+
RUN poetry install --without dev
|
|
233
|
+
|
|
234
|
+
ENTRYPOINT ["poetry", "run", "python", "-m", "src.training.train"]
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
**Docker Compose for development**:
|
|
238
|
+
```yaml
|
|
239
|
+
# docker-compose.yml
|
|
240
|
+
services:
|
|
241
|
+
train:
|
|
242
|
+
build: .
|
|
243
|
+
volumes:
|
|
244
|
+
- ./data:/app/data
|
|
245
|
+
- ./models:/app/models
|
|
246
|
+
- ./configs:/app/configs
|
|
247
|
+
environment:
|
|
248
|
+
- MLFLOW_TRACKING_URI=http://mlflow:5000
|
|
249
|
+
deploy:
|
|
250
|
+
resources:
|
|
251
|
+
reservations:
|
|
252
|
+
devices:
|
|
253
|
+
- driver: nvidia
|
|
254
|
+
count: all
|
|
255
|
+
capabilities: [gpu]
|
|
256
|
+
|
|
257
|
+
mlflow:
|
|
258
|
+
image: ghcr.io/mlflow/mlflow:v2.9.2
|
|
259
|
+
ports:
|
|
260
|
+
- "5000:5000"
|
|
261
|
+
volumes:
|
|
262
|
+
- ./mlruns:/mlflow/mlruns
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
### Makefile Task Runner
|
|
266
|
+
|
|
267
|
+
Encode common tasks in a `Makefile` to eliminate "how do I run this?" questions:
|
|
268
|
+
|
|
269
|
+
```makefile
|
|
270
|
+
.PHONY: env train eval test lint clean
|
|
271
|
+
|
|
272
|
+
env:
|
|
273
|
+
conda env create -f environment.yml || conda env update -f environment.yml --prune
|
|
274
|
+
|
|
275
|
+
train:
|
|
276
|
+
poetry run python -m src.training.train $(ARGS)
|
|
277
|
+
|
|
278
|
+
eval:
|
|
279
|
+
poetry run python -m src.evaluation.evaluator $(ARGS)
|
|
280
|
+
|
|
281
|
+
test:
|
|
282
|
+
poetry run pytest tests/ -v
|
|
283
|
+
|
|
284
|
+
lint:
|
|
285
|
+
poetry run black --check src/ tests/
|
|
286
|
+
poetry run mypy src/
|
|
287
|
+
|
|
288
|
+
clean:
|
|
289
|
+
find . -type f -name "*.pyc" -delete
|
|
290
|
+
find . -type d -name "__pycache__" -delete
|
|
291
|
+
rm -rf .pytest_cache/
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
Usage:
|
|
295
|
+
```bash
|
|
296
|
+
make env # Set up environment
|
|
297
|
+
make train ARGS="optimizer.lr=1e-4"
|
|
298
|
+
make test
|
|
299
|
+
```
|