ato 2.0.4__py3-none-any.whl → 2.1.4__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ato/__init__.py +1 -1
- ato/scope.py +53 -23
- ato/trace.py +27 -0
- ato-2.1.4.dist-info/METADATA +1134 -0
- {ato-2.0.4.dist-info → ato-2.1.4.dist-info}/RECORD +8 -7
- ato-2.0.4.dist-info/METADATA +0 -978
- {ato-2.0.4.dist-info → ato-2.1.4.dist-info}/WHEEL +0 -0
- {ato-2.0.4.dist-info → ato-2.1.4.dist-info}/licenses/LICENSE +0 -0
- {ato-2.0.4.dist-info → ato-2.1.4.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,1134 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: ato
|
|
3
|
+
Version: 2.1.4
|
|
4
|
+
Summary: Configuration, experimentation, and hyperparameter optimization for Python. No runtime magic. No launcher. Just Python modules you compose.
|
|
5
|
+
Author: ato contributors
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/yourusername/ato
|
|
8
|
+
Project-URL: Repository, https://github.com/yourusername/ato
|
|
9
|
+
Project-URL: Documentation, https://github.com/yourusername/ato#readme
|
|
10
|
+
Project-URL: Issues, https://github.com/yourusername/ato/issues
|
|
11
|
+
Keywords: config management,experiment tracking,hyperparameter optimization,lightweight,composable,namespace isolation,machine learning
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.7
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
22
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
23
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
24
|
+
Requires-Python: >=3.7
|
|
25
|
+
Description-Content-Type: text/markdown
|
|
26
|
+
License-File: LICENSE
|
|
27
|
+
Requires-Dist: pyyaml>=6.0
|
|
28
|
+
Requires-Dist: toml>=0.10.2
|
|
29
|
+
Requires-Dist: sqlalchemy>=2.0
|
|
30
|
+
Requires-Dist: numpy>=1.19.0
|
|
31
|
+
Provides-Extra: distributed
|
|
32
|
+
Requires-Dist: torch>=1.8.0; extra == "distributed"
|
|
33
|
+
Dynamic: license-file
|
|
34
|
+
|
|
35
|
+
# Ato: A Thin Operating Layer
|
|
36
|
+
|
|
37
|
+
**Minimal reproducibility for ML.**
|
|
38
|
+
Tracks config structure, code, and runtime so you can explain why runs differ — without a platform.
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
pip install ato
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
# Your experiment breaks. Here's how to debug it:
|
|
46
|
+
python train.py manual # See exactly how configs merged
|
|
47
|
+
finder.get_trace_statistics('my_project', 'train_step') # See which code versions ran
|
|
48
|
+
finder.find_similar_runs(run_id=123) # Find experiments with same structure
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
**One question:** "Why did this result change?"
|
|
52
|
+
Ato fingerprints config structure, function logic, and runtime outputs to answer it.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## What Ato Is
|
|
57
|
+
|
|
58
|
+
Ato is a thin layer that fingerprints your config structure, function logic, and runtime outputs.
|
|
59
|
+
|
|
60
|
+
It doesn't replace your stack; it sits beside it to answer one question: **"Why did this result change?"**
|
|
61
|
+
|
|
62
|
+
Three pieces, zero coupling:
|
|
63
|
+
|
|
64
|
+
1. **ADict** — Structural hashing for configs (tracks architecture changes, not just values)
|
|
65
|
+
2. **Scope** — Priority-based config merging with reasoning and code fingerprinting
|
|
66
|
+
3. **SQLTracker** — Local-first experiment tracking in SQLite (zero setup, zero servers)
|
|
67
|
+
|
|
68
|
+
Each works alone. Together, they explain why experiments diverge.
|
|
69
|
+
|
|
70
|
+
**Config is not logging — it's reasoning.**
|
|
71
|
+
Ato makes config merge order, priority, and causality visible.
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Config Superpowers (That Make Reproducibility Real)
|
|
76
|
+
|
|
77
|
+
These aren't features. They're how Ato is built:
|
|
78
|
+
|
|
79
|
+
| Capability | What It Does | Why It Matters |
|
|
80
|
+
|------------|--------------|----------------|
|
|
81
|
+
| **Structural hashing** | Hash based on keys + types, not values | Detect when experiment **architecture** changes, not just hyperparameters |
|
|
82
|
+
| **Priority/merge reasoning** | Explicit merge order with `manual` inspection | See **why** a config value won — trace the entire merge path |
|
|
83
|
+
| **Namespace isolation** | Each scope owns its keys | Team/module collisions are impossible — no need for `model_lr` vs `data_lr` prefixes |
|
|
84
|
+
| **Code fingerprinting** | SHA256 of function bytecode, not git commits | Track **logic changes** automatically — refactoring doesn't create false versions |
|
|
85
|
+
| **Runtime fingerprinting** | SHA256 of actual outputs | Detect silent failures when code is unchanged but behavior differs |
|
|
86
|
+
|
|
87
|
+
**No dashboards. No servers. No ecosystems.**
|
|
88
|
+
Just fingerprints and SQLite.
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Works With Your Stack (Keep Hydra, MLflow, W&B, ...)
|
|
93
|
+
|
|
94
|
+
Ato doesn't compete with your config system or tracking platform.
|
|
95
|
+
It **observes and fingerprints** what you already use.
|
|
96
|
+
|
|
97
|
+
**Compose configs however you like:**
|
|
98
|
+
- Load Hydra/OmegaConf configs → Ato fingerprints the final merged structure
|
|
99
|
+
- Use argparse → Ato observes and integrates seamlessly
|
|
100
|
+
- Import OpenMMLab configs → Ato handles `_base_` inheritance automatically
|
|
101
|
+
- Mix YAML/JSON/TOML → Ato is format-agnostic
|
|
102
|
+
|
|
103
|
+
**Track experiments however you like:**
|
|
104
|
+
- Log to MLflow/W&B for dashboards → Ato tracks causality in local SQLite
|
|
105
|
+
- Use both together → Cloud tracking for metrics, Ato for "why did this change?"
|
|
106
|
+
- Or just use Ato → Zero-setup local tracking with full history
|
|
107
|
+
|
|
108
|
+
**Ato is a complement, not a replacement.**
|
|
109
|
+
No migration required. No lock-in. Add it incrementally.
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
## When to Use Ato
|
|
114
|
+
|
|
115
|
+
Use Ato when:
|
|
116
|
+
|
|
117
|
+
- **Experiments diverge occasionally** and you need to narrow down the cause
|
|
118
|
+
- **Config include/override order** changes results in unexpected ways
|
|
119
|
+
- **"I didn't change the code but results differ"** happens repeatedly (dependency/environment/bytecode drift)
|
|
120
|
+
- **Multiple people modify configs** and you need to trace who set what and why
|
|
121
|
+
- **You're debugging non-determinism** and need runtime fingerprints to catch silent failures
|
|
122
|
+
|
|
123
|
+
**Ato is for causality, not compliance.**
|
|
124
|
+
If you need audit trails or dashboards, keep using your existing tracking platform.
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## Non-Goals
|
|
129
|
+
|
|
130
|
+
Ato is **not**:
|
|
131
|
+
|
|
132
|
+
- A pipeline orchestrator (use Airflow, Prefect, Luigi, ...)
|
|
133
|
+
- A hyperparameter scheduler (use Optuna, Ray Tune, ...)
|
|
134
|
+
- A model registry (use MLflow Model Registry, ...)
|
|
135
|
+
- An experiment dashboard (use MLflow, W&B, TensorBoard, ...)
|
|
136
|
+
- A dataset versioner (use DVC, Pachyderm, ...)
|
|
137
|
+
|
|
138
|
+
**Ato has one job:** Explain why results changed.
|
|
139
|
+
Everything else belongs in specialized tools.
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
143
|
+
## Incremental Adoption (No Migration Required)
|
|
144
|
+
|
|
145
|
+
You don't need to replace anything. Add Ato in steps:
|
|
146
|
+
|
|
147
|
+
**Step 1: Fingerprint config structure (zero code changes)**
|
|
148
|
+
```python
|
|
149
|
+
from ato.adict import ADict
|
|
150
|
+
|
|
151
|
+
config = ADict(lr=0.001, batch_size=32, model='resnet50')
|
|
152
|
+
print(config.get_structural_hash()) # Tracks structure, not values
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
**Step 2: Add code fingerprinting to key functions**
|
|
156
|
+
```python
|
|
157
|
+
from ato.scope import Scope
|
|
158
|
+
|
|
159
|
+
scope = Scope()
|
|
160
|
+
|
|
161
|
+
@scope.trace(trace_id='train_step')
|
|
162
|
+
@scope
|
|
163
|
+
def train_epoch(config):
|
|
164
|
+
# Your training code
|
|
165
|
+
pass
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
**Step 3: Add runtime fingerprinting to outputs**
|
|
169
|
+
```python
|
|
170
|
+
@scope.runtime_trace(
|
|
171
|
+
trace_id='predictions',
|
|
172
|
+
init_fn=lambda: np.random.seed(42), # Fix randomness
|
|
173
|
+
inspect_fn=lambda preds: preds[:100] # Track first 100
|
|
174
|
+
)
|
|
175
|
+
def evaluate(model, data):
|
|
176
|
+
return model.predict(data)
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
**Step 4: Inspect config merge order**
|
|
180
|
+
```bash
|
|
181
|
+
python train.py manual # See exactly how configs merged
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
**Step 5: Track experiments locally**
|
|
185
|
+
```python
|
|
186
|
+
from ato.db_routers.sql.manager import SQLLogger
|
|
187
|
+
|
|
188
|
+
logger = SQLLogger(config)
|
|
189
|
+
run_id = logger.run(tags=['baseline'])
|
|
190
|
+
# Your training loop
|
|
191
|
+
logger.log_metric('loss', loss, step=epoch)
|
|
192
|
+
logger.finish(status='completed')
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Quick Start
|
|
198
|
+
|
|
199
|
+
**Three lines to tracked experiments:**
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
from ato.scope import Scope
|
|
203
|
+
|
|
204
|
+
scope = Scope()
|
|
205
|
+
|
|
206
|
+
@scope.observe(default=True)
|
|
207
|
+
def config(config):
|
|
208
|
+
config.lr = 0.001
|
|
209
|
+
config.batch_size = 32
|
|
210
|
+
config.model = 'resnet50'
|
|
211
|
+
|
|
212
|
+
@scope
|
|
213
|
+
def train(config):
|
|
214
|
+
print(f"Training {config.model} with lr={config.lr}")
|
|
215
|
+
|
|
216
|
+
if __name__ == '__main__':
|
|
217
|
+
train()
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
**Run it:**
|
|
221
|
+
```bash
|
|
222
|
+
python train.py # Uses defaults
|
|
223
|
+
python train.py lr=0.01 # Override from CLI
|
|
224
|
+
python train.py manual # See config merge order
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Table of Contents
|
|
230
|
+
|
|
231
|
+
- [ADict: Structural Hashing](#adict-structural-hashing)
|
|
232
|
+
- [Scope: Config Reasoning](#scope-config-reasoning)
|
|
233
|
+
- [Priority-based Merging](#priority-based-merging)
|
|
234
|
+
- [Config Chaining](#config-chaining)
|
|
235
|
+
- [Lazy Evaluation](#lazy-evaluation)
|
|
236
|
+
- [MultiScope: Namespace Isolation](#multiscope-namespace-isolation)
|
|
237
|
+
- [Config Documentation & Debugging](#config-documentation--debugging)
|
|
238
|
+
- [Code Fingerprinting](#code-fingerprinting)
|
|
239
|
+
- [Runtime Fingerprinting](#runtime-fingerprinting)
|
|
240
|
+
- [SQL Tracker: Local Experiment Tracking](#sql-tracker-local-experiment-tracking)
|
|
241
|
+
- [Hyperparameter Optimization](#hyperparameter-optimization)
|
|
242
|
+
- [Best Practices](#best-practices)
|
|
243
|
+
- [FAQ](#faq)
|
|
244
|
+
- [Quality Signals](#quality-signals)
|
|
245
|
+
- [Contributing](#contributing)
|
|
246
|
+
|
|
247
|
+
---
|
|
248
|
+
|
|
249
|
+
## ADict: Structural Hashing
|
|
250
|
+
|
|
251
|
+
`ADict` tracks when experiment **architecture** changes, not just hyperparameter values.
|
|
252
|
+
|
|
253
|
+
### Core Capabilities
|
|
254
|
+
|
|
255
|
+
| Feature | Description |
|
|
256
|
+
|---------|-------------|
|
|
257
|
+
| **Structural Hashing** | Hash based on keys + types → detect architecture changes |
|
|
258
|
+
| **Nested Access** | Dot notation: `config.model.lr` instead of `config['model']['lr']` |
|
|
259
|
+
| **Format Agnostic** | Load/save JSON, YAML, TOML, XYZ |
|
|
260
|
+
| **Safe Updates** | `update_if_absent()` → merge without overwrites |
|
|
261
|
+
| **Auto-nested** | `ADict.auto()` → `config.a.b.c = 1` just works |
|
|
262
|
+
|
|
263
|
+
### Examples
|
|
264
|
+
|
|
265
|
+
#### Structural Hashing
|
|
266
|
+
|
|
267
|
+
```python
|
|
268
|
+
from ato.adict import ADict
|
|
269
|
+
|
|
270
|
+
# Same structure, different values
|
|
271
|
+
config1 = ADict(lr=0.1, epochs=100, model='resnet50')
|
|
272
|
+
config2 = ADict(lr=0.01, epochs=200, model='resnet101')
|
|
273
|
+
print(config1.get_structural_hash() == config2.get_structural_hash()) # True
|
|
274
|
+
|
|
275
|
+
# Different structure (epochs is str!)
|
|
276
|
+
config3 = ADict(lr=0.1, epochs='100', model='resnet50')
|
|
277
|
+
print(config1.get_structural_hash() == config3.get_structural_hash()) # False
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
**Why this matters:**
|
|
281
|
+
When results differ, you need to know if the experiment **architecture** changed or just the values.
|
|
282
|
+
|
|
283
|
+
#### Auto-nested Configs
|
|
284
|
+
|
|
285
|
+
```python
|
|
286
|
+
# ❌ Traditional way
|
|
287
|
+
config = ADict()
|
|
288
|
+
config.model = ADict()
|
|
289
|
+
config.model.backbone = ADict()
|
|
290
|
+
config.model.backbone.layers = [64, 128, 256]
|
|
291
|
+
|
|
292
|
+
# ✅ With ADict.auto()
|
|
293
|
+
config = ADict.auto()
|
|
294
|
+
config.model.backbone.layers = [64, 128, 256] # Just works!
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
#### Format Agnostic
|
|
298
|
+
|
|
299
|
+
```python
|
|
300
|
+
# Load/save any format
|
|
301
|
+
config = ADict.from_file('config.json')
|
|
302
|
+
config.dump('config.yaml')
|
|
303
|
+
|
|
304
|
+
# Safe updates
|
|
305
|
+
config.update_if_absent(lr=0.01, scheduler='cosine') # Only adds scheduler
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
## Scope: Config Reasoning
|
|
311
|
+
|
|
312
|
+
Scope manages configuration through **priority-based merging** with **full reasoning**.
|
|
313
|
+
|
|
314
|
+
**Config is not logging — it's reasoning.**
|
|
315
|
+
Scope makes merge order, priority, and causality visible.
|
|
316
|
+
|
|
317
|
+
### Priority-based Merging
|
|
318
|
+
|
|
319
|
+
```
|
|
320
|
+
Default Configs (priority=0)
|
|
321
|
+
↓
|
|
322
|
+
Named Configs (priority=0+)
|
|
323
|
+
↓
|
|
324
|
+
CLI Arguments (highest priority)
|
|
325
|
+
↓
|
|
326
|
+
Lazy Configs (computed after CLI)
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
#### Example
|
|
330
|
+
|
|
331
|
+
```python
|
|
332
|
+
from ato.scope import Scope
|
|
333
|
+
|
|
334
|
+
scope = Scope()
|
|
335
|
+
|
|
336
|
+
@scope.observe(default=True) # Always applied
|
|
337
|
+
def defaults(config):
|
|
338
|
+
config.lr = 0.001
|
|
339
|
+
config.epochs = 100
|
|
340
|
+
|
|
341
|
+
@scope.observe(priority=1) # Applied after defaults
|
|
342
|
+
def high_lr(config):
|
|
343
|
+
config.lr = 0.01
|
|
344
|
+
|
|
345
|
+
@scope.observe(priority=2) # Applied last
|
|
346
|
+
def long_training(config):
|
|
347
|
+
config.epochs = 300
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
python train.py # lr=0.001, epochs=100
|
|
352
|
+
python train.py high_lr # lr=0.01, epochs=100
|
|
353
|
+
python train.py high_lr long_training # lr=0.01, epochs=300
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
### Config Chaining
|
|
357
|
+
|
|
358
|
+
Chain configs with dependencies:
|
|
359
|
+
|
|
360
|
+
```python
|
|
361
|
+
@scope.observe()
|
|
362
|
+
def base_setup(config):
|
|
363
|
+
config.project_name = 'my_project'
|
|
364
|
+
config.data_dir = '/data'
|
|
365
|
+
|
|
366
|
+
@scope.observe(chain_with='base_setup') # Automatically applies base_setup first
|
|
367
|
+
def advanced_training(config):
|
|
368
|
+
config.distributed = True
|
|
369
|
+
config.mixed_precision = True
|
|
370
|
+
|
|
371
|
+
@scope.observe(chain_with=['base_setup', 'gpu_setup']) # Multiple dependencies
|
|
372
|
+
def multi_node_training(config):
|
|
373
|
+
config.nodes = 4
|
|
374
|
+
config.world_size = 16
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
```bash
|
|
378
|
+
# Calling advanced_training automatically applies base_setup first
|
|
379
|
+
python train.py advanced_training
|
|
380
|
+
# Results in: base_setup → advanced_training
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
### Lazy Evaluation
|
|
384
|
+
|
|
385
|
+
**Note:** Lazy evaluation requires Python 3.8 or higher.
|
|
386
|
+
|
|
387
|
+
Compute configs **after** CLI args are applied:
|
|
388
|
+
|
|
389
|
+
```python
|
|
390
|
+
@scope.observe()
|
|
391
|
+
def base_config(config):
|
|
392
|
+
config.model = 'resnet50'
|
|
393
|
+
config.dataset = 'imagenet'
|
|
394
|
+
|
|
395
|
+
@scope.observe(lazy=True) # Evaluated AFTER CLI args
|
|
396
|
+
def computed_config(config):
|
|
397
|
+
# Adjust based on dataset
|
|
398
|
+
if config.dataset == 'imagenet':
|
|
399
|
+
config.num_classes = 1000
|
|
400
|
+
config.image_size = 224
|
|
401
|
+
elif config.dataset == 'cifar10':
|
|
402
|
+
config.num_classes = 10
|
|
403
|
+
config.image_size = 32
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
```bash
|
|
407
|
+
python train.py dataset=%cifar10% computed_config
|
|
408
|
+
# Results in: num_classes=10, image_size=32
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
**Python 3.11+ Context Manager:**
|
|
412
|
+
|
|
413
|
+
```python
|
|
414
|
+
@scope.observe()
|
|
415
|
+
def my_config(config):
|
|
416
|
+
config.model = 'resnet50'
|
|
417
|
+
config.num_layers = 50
|
|
418
|
+
|
|
419
|
+
with Scope.lazy(): # Evaluated after CLI
|
|
420
|
+
if config.model == 'resnet101':
|
|
421
|
+
config.num_layers = 101
|
|
422
|
+
```
|
|
423
|
+
|
|
424
|
+
### MultiScope: Namespace Isolation
|
|
425
|
+
|
|
426
|
+
Manage completely separate configuration namespaces with independent priority systems.
|
|
427
|
+
|
|
428
|
+
**Use case:** Different teams own different scopes without key collisions.
|
|
429
|
+
|
|
430
|
+
```python
|
|
431
|
+
from ato.scope import Scope, MultiScope
|
|
432
|
+
|
|
433
|
+
model_scope = Scope(name='model')
|
|
434
|
+
data_scope = Scope(name='data')
|
|
435
|
+
scope = MultiScope(model_scope, data_scope)
|
|
436
|
+
|
|
437
|
+
@model_scope.observe(default=True)
|
|
438
|
+
def model_config(model):
|
|
439
|
+
model.backbone = 'resnet50'
|
|
440
|
+
model.lr = 0.1 # Model-specific learning rate
|
|
441
|
+
|
|
442
|
+
@data_scope.observe(default=True)
|
|
443
|
+
def data_config(data):
|
|
444
|
+
data.dataset = 'cifar10'
|
|
445
|
+
data.lr = 0.001 # Data augmentation learning rate (no conflict!)
|
|
446
|
+
|
|
447
|
+
@scope
|
|
448
|
+
def train(model, data): # Named parameters match scope names
|
|
449
|
+
# Both have 'lr' but in separate namespaces!
|
|
450
|
+
print(f"Model LR: {model.lr}, Data LR: {data.lr}")
|
|
451
|
+
```
|
|
452
|
+
|
|
453
|
+
**Key advantage:** `model.lr` and `data.lr` are completely independent. No naming prefixes needed.
|
|
454
|
+
|
|
455
|
+
**CLI with MultiScope:**
|
|
456
|
+
|
|
457
|
+
```bash
|
|
458
|
+
# Override model scope only
|
|
459
|
+
python train.py model.backbone=%resnet101%
|
|
460
|
+
|
|
461
|
+
# Override both
|
|
462
|
+
python train.py model.backbone=%resnet101% data.dataset=%imagenet%
|
|
463
|
+
```
|
|
464
|
+
|
|
465
|
+
### Config Documentation & Debugging
|
|
466
|
+
|
|
467
|
+
**The `manual` command** visualizes the exact order of configuration application.
|
|
468
|
+
|
|
469
|
+
```python
|
|
470
|
+
@scope.observe(default=True)
|
|
471
|
+
def config(config):
|
|
472
|
+
config.lr = 0.001
|
|
473
|
+
config.batch_size = 32
|
|
474
|
+
config.model = 'resnet50'
|
|
475
|
+
|
|
476
|
+
@scope.manual
|
|
477
|
+
def config_docs(config):
|
|
478
|
+
config.lr = 'Learning rate for optimizer'
|
|
479
|
+
config.batch_size = 'Number of samples per batch'
|
|
480
|
+
config.model = 'Model architecture (resnet50, resnet101, etc.)'
|
|
481
|
+
```
|
|
482
|
+
|
|
483
|
+
```bash
|
|
484
|
+
python train.py manual
|
|
485
|
+
```
|
|
486
|
+
|
|
487
|
+
**Output:**
|
|
488
|
+
```
|
|
489
|
+
--------------------------------------------------
|
|
490
|
+
[Scope "config"]
|
|
491
|
+
(The Applying Order of Views)
|
|
492
|
+
config → (CLI Inputs)
|
|
493
|
+
|
|
494
|
+
(User Manuals)
|
|
495
|
+
lr: Learning rate for optimizer
|
|
496
|
+
batch_size: Number of samples per batch
|
|
497
|
+
model: Model architecture (resnet50, resnet101, etc.)
|
|
498
|
+
--------------------------------------------------
|
|
499
|
+
```
|
|
500
|
+
|
|
501
|
+
**Why this matters:**
|
|
502
|
+
When debugging "why is this config value not what I expect?", you see **exactly** which function set it and in what order.
|
|
503
|
+
|
|
504
|
+
**Complex example:**
|
|
505
|
+
|
|
506
|
+
```python
|
|
507
|
+
@scope.observe(default=True)
|
|
508
|
+
def defaults(config):
|
|
509
|
+
config.lr = 0.001
|
|
510
|
+
|
|
511
|
+
@scope.observe(priority=1)
|
|
512
|
+
def experiment_config(config):
|
|
513
|
+
config.lr = 0.01
|
|
514
|
+
|
|
515
|
+
@scope.observe(priority=2)
|
|
516
|
+
def another_config(config):
|
|
517
|
+
config.lr = 0.1
|
|
518
|
+
|
|
519
|
+
@scope.observe(lazy=True)
|
|
520
|
+
def adaptive_lr(config):
|
|
521
|
+
if config.batch_size > 64:
|
|
522
|
+
config.lr = config.lr * 2
|
|
523
|
+
```
|
|
524
|
+
|
|
525
|
+
When you run `python train.py manual`, you see:
|
|
526
|
+
```
|
|
527
|
+
(The Applying Order of Views)
|
|
528
|
+
defaults → experiment_config → another_config → (CLI Inputs) → adaptive_lr
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
Now it's **crystal clear** why `lr=0.1` (from `another_config`) and not `0.01`!
|
|
532
|
+
|
|
533
|
+
### Code Fingerprinting
|
|
534
|
+
|
|
535
|
+
Track **logic changes** automatically, ignoring cosmetic edits.
|
|
536
|
+
|
|
537
|
+
#### Static Tracing (`@scope.trace`)
|
|
538
|
+
|
|
539
|
+
Generates a fingerprint of the function's **logic**, not its name or formatting:
|
|
540
|
+
|
|
541
|
+
```python
|
|
542
|
+
# These three functions have IDENTICAL fingerprints
|
|
543
|
+
@scope.trace(trace_id='train_step')
|
|
544
|
+
@scope
|
|
545
|
+
def train_v1(config):
|
|
546
|
+
loss = model(data)
|
|
547
|
+
return loss
|
|
548
|
+
|
|
549
|
+
@scope.trace(trace_id='train_step')
|
|
550
|
+
@scope
|
|
551
|
+
def train_v2(config):
|
|
552
|
+
# Added comments
|
|
553
|
+
loss = model(data) # Compute loss
|
|
554
|
+
return loss
|
|
555
|
+
|
|
556
|
+
@scope.trace(trace_id='train_step')
|
|
557
|
+
@scope
|
|
558
|
+
def completely_different_name(config):
|
|
559
|
+
loss=model(data) # Different whitespace
|
|
560
|
+
return loss
|
|
561
|
+
```
|
|
562
|
+
|
|
563
|
+
All three produce the **same fingerprint** because the underlying logic is identical.
|
|
564
|
+
|
|
565
|
+
**When fingerprints change:**
|
|
566
|
+
|
|
567
|
+
```python
|
|
568
|
+
@scope.trace(trace_id='train_step')
|
|
569
|
+
@scope
|
|
570
|
+
def train_v1(config):
|
|
571
|
+
loss = model(data)
|
|
572
|
+
return loss
|
|
573
|
+
|
|
574
|
+
@scope.trace(trace_id='train_step')
|
|
575
|
+
@scope
|
|
576
|
+
def train_v2(config):
|
|
577
|
+
loss = model(data) * 2 # ← Logic changed!
|
|
578
|
+
return loss
|
|
579
|
+
```
|
|
580
|
+
|
|
581
|
+
Now fingerprints differ — you've changed the actual computation.
|
|
582
|
+
|
|
583
|
+
**Example: Catching refactoring bugs**
|
|
584
|
+
|
|
585
|
+
```python
|
|
586
|
+
# Original implementation
|
|
587
|
+
@scope.trace(trace_id='forward_pass')
|
|
588
|
+
@scope
|
|
589
|
+
def forward(model, x):
|
|
590
|
+
out = model(x)
|
|
591
|
+
return out
|
|
592
|
+
|
|
593
|
+
# Safe refactoring: Added comments, changed variable name, different whitespace
|
|
594
|
+
@scope.trace(trace_id='forward_pass')
|
|
595
|
+
@scope
|
|
596
|
+
def forward(model,x):
|
|
597
|
+
# Forward pass through model
|
|
598
|
+
result=model(x) # No spaces
|
|
599
|
+
return result
|
|
600
|
+
```
|
|
601
|
+
|
|
602
|
+
These have **the same fingerprint** because the underlying logic is identical — only cosmetic changes.
|
|
603
|
+
|
|
604
|
+
```python
|
|
605
|
+
# Unsafe refactoring: Logic changed
|
|
606
|
+
@scope.trace(trace_id='forward_pass')
|
|
607
|
+
@scope
|
|
608
|
+
def forward(model, x):
|
|
609
|
+
features = model.backbone(x) # Now calling backbone + head separately!
|
|
610
|
+
logits = model.head(features)
|
|
611
|
+
return logits
|
|
612
|
+
```
|
|
613
|
+
|
|
614
|
+
This has a **different fingerprint** — the logic changed. If you expected them to be equivalent but they have different fingerprints, you've caught a refactoring bug.
|
|
615
|
+
|
|
616
|
+
### Runtime Fingerprinting
|
|
617
|
+
|
|
618
|
+
Track what the function **produces**, not what it does.
|
|
619
|
+
|
|
620
|
+
#### Runtime Tracing (`@scope.runtime_trace`)
|
|
621
|
+
|
|
622
|
+
```python
|
|
623
|
+
import numpy as np
|
|
624
|
+
|
|
625
|
+
# Basic: Track full output
|
|
626
|
+
@scope.runtime_trace(trace_id='predictions')
|
|
627
|
+
@scope
|
|
628
|
+
def evaluate(model, data):
|
|
629
|
+
return model.predict(data)
|
|
630
|
+
|
|
631
|
+
# With init_fn: Fix randomness for reproducibility
|
|
632
|
+
@scope.runtime_trace(
|
|
633
|
+
trace_id='predictions',
|
|
634
|
+
init_fn=lambda: np.random.seed(42) # Initialize before execution
|
|
635
|
+
)
|
|
636
|
+
@scope
|
|
637
|
+
def evaluate_with_dropout(model, data):
|
|
638
|
+
return model.predict(data) # Now deterministic
|
|
639
|
+
|
|
640
|
+
# With inspect_fn: Track specific parts of output
|
|
641
|
+
@scope.runtime_trace(
|
|
642
|
+
trace_id='predictions',
|
|
643
|
+
inspect_fn=lambda preds: preds[:100] # Only hash first 100 predictions
|
|
644
|
+
)
|
|
645
|
+
@scope
|
|
646
|
+
def evaluate_large_output(model, data):
|
|
647
|
+
return model.predict(data)
|
|
648
|
+
|
|
649
|
+
# Advanced: Type-only checking (ignore values)
|
|
650
|
+
@scope.runtime_trace(
|
|
651
|
+
trace_id='predictions',
|
|
652
|
+
inspect_fn=lambda preds: type(preds).__name__ # Track output type only
|
|
653
|
+
)
|
|
654
|
+
@scope
|
|
655
|
+
def evaluate_structure(model, data):
|
|
656
|
+
return model.predict(data)
|
|
657
|
+
```
|
|
658
|
+
|
|
659
|
+
**Parameters:**
|
|
660
|
+
- `init_fn`: Optional function called before execution (e.g., seed fixing, device setup)
|
|
661
|
+
- `inspect_fn`: Optional function to extract/filter what to track (e.g., first N items, specific fields, types only)
|
|
662
|
+
|
|
663
|
+
Even if code hasn't changed, if predictions differ, the runtime fingerprint changes.
|
|
664
|
+
|
|
665
|
+
#### When to Use Each
|
|
666
|
+
|
|
667
|
+
**Use `@scope.trace()` when:**
|
|
668
|
+
- You want to track code changes automatically
|
|
669
|
+
- You're refactoring and want to isolate performance impact
|
|
670
|
+
- You need to audit "which code produced this result?"
|
|
671
|
+
- You want to ignore cosmetic changes (comments, whitespace, renaming)
|
|
672
|
+
|
|
673
|
+
**Use `@scope.runtime_trace()` when:**
|
|
674
|
+
- You want to detect **silent failures** (code unchanged, output wrong)
|
|
675
|
+
- You're debugging non-determinism
|
|
676
|
+
- You need to verify model behavior across versions
|
|
677
|
+
- You care about what the function produces, not how it's written
|
|
678
|
+
|
|
679
|
+
**Use both when:**
|
|
680
|
+
- Building production ML systems
|
|
681
|
+
- Running long-term research experiments
|
|
682
|
+
- Multiple people modifying the same codebase
|
|
683
|
+
|
|
684
|
+
---
|
|
685
|
+
|
|
686
|
+
## SQL Tracker: Local Experiment Tracking
|
|
687
|
+
|
|
688
|
+
Lightweight experiment tracking using SQLite.
|
|
689
|
+
|
|
690
|
+
### Why SQL Tracker?
|
|
691
|
+
|
|
692
|
+
- **Zero Setup**: Just a SQLite file, no servers
|
|
693
|
+
- **Full History**: Track all runs, metrics, and artifacts
|
|
694
|
+
- **Smart Search**: Find similar experiments by config structure
|
|
695
|
+
- **Code Versioning**: Track code changes via fingerprints
|
|
696
|
+
- **Offline-first**: No network required
|
|
697
|
+
|
|
698
|
+
### Database Schema
|
|
699
|
+
|
|
700
|
+
```
|
|
701
|
+
Project (my_ml_project)
|
|
702
|
+
├── Experiment (run_1)
|
|
703
|
+
│ ├── config: {...}
|
|
704
|
+
│ ├── structural_hash: "abc123..."
|
|
705
|
+
│ ├── Metrics: [loss, accuracy, ...]
|
|
706
|
+
│ ├── Artifacts: [model.pt, plots/*, ...]
|
|
707
|
+
│ └── Fingerprints: [model_forward, train_step, ...]
|
|
708
|
+
├── Experiment (run_2)
|
|
709
|
+
└── ...
|
|
710
|
+
```
|
|
711
|
+
|
|
712
|
+
### Usage
|
|
713
|
+
|
|
714
|
+
#### Logging Experiments
|
|
715
|
+
|
|
716
|
+
```python
|
|
717
|
+
from ato.db_routers.sql.manager import SQLLogger
|
|
718
|
+
from ato.adict import ADict
|
|
719
|
+
|
|
720
|
+
# Setup config
|
|
721
|
+
config = ADict(
|
|
722
|
+
experiment=ADict(
|
|
723
|
+
project_name='image_classification',
|
|
724
|
+
sql=ADict(db_path='sqlite:///experiments.db')
|
|
725
|
+
),
|
|
726
|
+
# Your hyperparameters
|
|
727
|
+
lr=0.001,
|
|
728
|
+
batch_size=32,
|
|
729
|
+
model='resnet50'
|
|
730
|
+
)
|
|
731
|
+
|
|
732
|
+
# Create logger
|
|
733
|
+
logger = SQLLogger(config)
|
|
734
|
+
|
|
735
|
+
# Start experiment run
|
|
736
|
+
run_id = logger.run(tags=['baseline', 'resnet50', 'cifar10'])
|
|
737
|
+
|
|
738
|
+
# Training loop
|
|
739
|
+
for epoch in range(100):
|
|
740
|
+
# Your training code
|
|
741
|
+
train_loss = train_one_epoch()
|
|
742
|
+
val_acc = validate()
|
|
743
|
+
|
|
744
|
+
# Log metrics
|
|
745
|
+
logger.log_metric('train_loss', train_loss, step=epoch)
|
|
746
|
+
logger.log_metric('val_accuracy', val_acc, step=epoch)
|
|
747
|
+
|
|
748
|
+
# Log artifacts
|
|
749
|
+
logger.log_artifact(run_id, 'checkpoints/model_best.pt',
|
|
750
|
+
data_type='model',
|
|
751
|
+
metadata={'epoch': best_epoch})
|
|
752
|
+
|
|
753
|
+
# Finish run
|
|
754
|
+
logger.finish(status='completed')
|
|
755
|
+
```
|
|
756
|
+
|
|
757
|
+
#### Querying Experiments
|
|
758
|
+
|
|
759
|
+
```python
|
|
760
|
+
from ato.db_routers.sql.manager import SQLFinder
|
|
761
|
+
|
|
762
|
+
finder = SQLFinder(config)
|
|
763
|
+
|
|
764
|
+
# Get all runs in project
|
|
765
|
+
runs = finder.get_runs_in_project('image_classification')
|
|
766
|
+
for run in runs:
|
|
767
|
+
print(f"Run {run.id}: {run.config.model} - {run.status}")
|
|
768
|
+
|
|
769
|
+
# Find best performing run
|
|
770
|
+
best_run = finder.find_best_run(
|
|
771
|
+
project_name='image_classification',
|
|
772
|
+
metric_key='val_accuracy',
|
|
773
|
+
mode='max' # or 'min' for loss
|
|
774
|
+
)
|
|
775
|
+
print(f"Best config: {best_run.config}")
|
|
776
|
+
|
|
777
|
+
# Find similar experiments (same config structure)
|
|
778
|
+
similar = finder.find_similar_runs(run_id=123)
|
|
779
|
+
print(f"Found {len(similar)} runs with similar config structure")
|
|
780
|
+
|
|
781
|
+
# Trace statistics (code fingerprints)
|
|
782
|
+
stats = finder.get_trace_statistics('image_classification', trace_id='model_forward')
|
|
783
|
+
print(f"Model forward pass has {stats['static_trace_versions']} versions")
|
|
784
|
+
```
|
|
785
|
+
|
|
786
|
+
### Features
|
|
787
|
+
|
|
788
|
+
| Feature | Description |
|
|
789
|
+
|---------|-------------|
|
|
790
|
+
| **Structural Hash** | Auto-track config structure changes |
|
|
791
|
+
| **Metric Logging** | Time-series metrics with step tracking |
|
|
792
|
+
| **Artifact Management** | Track model checkpoints, plots, data files |
|
|
793
|
+
| **Fingerprint Tracking** | Version control for code (static & runtime) |
|
|
794
|
+
| **Smart Search** | Find similar configs, best runs, statistics |
|
|
795
|
+
|
|
796
|
+
---
|
|
797
|
+
|
|
798
|
+
## Hyperparameter Optimization
|
|
799
|
+
|
|
800
|
+
Built-in **Hyperband** algorithm for efficient hyperparameter search with early stopping.
|
|
801
|
+
|
|
802
|
+
### How Hyperband Works
|
|
803
|
+
|
|
804
|
+
Hyperband uses successive halving:
|
|
805
|
+
1. Start with many configs, train briefly
|
|
806
|
+
2. Keep top performers, discard poor ones
|
|
807
|
+
3. Train survivors longer
|
|
808
|
+
4. Repeat until one winner remains
|
|
809
|
+
|
|
810
|
+
### Basic Usage
|
|
811
|
+
|
|
812
|
+
```python
|
|
813
|
+
from ato.adict import ADict
|
|
814
|
+
from ato.hyperopt.hyperband import HyperBand
|
|
815
|
+
from ato.scope import Scope
|
|
816
|
+
|
|
817
|
+
scope = Scope()
|
|
818
|
+
|
|
819
|
+
# Define search space
|
|
820
|
+
search_spaces = ADict(
|
|
821
|
+
lr=ADict(
|
|
822
|
+
param_type='FLOAT',
|
|
823
|
+
param_range=(1e-5, 1e-1),
|
|
824
|
+
num_samples=20,
|
|
825
|
+
space_type='LOG' # Logarithmic spacing
|
|
826
|
+
),
|
|
827
|
+
batch_size=ADict(
|
|
828
|
+
param_type='INTEGER',
|
|
829
|
+
param_range=(16, 128),
|
|
830
|
+
num_samples=5,
|
|
831
|
+
space_type='LOG'
|
|
832
|
+
),
|
|
833
|
+
model=ADict(
|
|
834
|
+
param_type='CATEGORY',
|
|
835
|
+
categories=['resnet50', 'resnet101', 'efficientnet_b0']
|
|
836
|
+
)
|
|
837
|
+
)
|
|
838
|
+
|
|
839
|
+
# Create Hyperband optimizer
|
|
840
|
+
hyperband = HyperBand(
|
|
841
|
+
scope,
|
|
842
|
+
search_spaces,
|
|
843
|
+
halving_rate=0.3, # Keep top 30% each round
|
|
844
|
+
num_min_samples=3, # Stop when <= 3 configs remain
|
|
845
|
+
mode='max' # Maximize metric (use 'min' for loss)
|
|
846
|
+
)
|
|
847
|
+
|
|
848
|
+
@hyperband.main
|
|
849
|
+
def train(config):
|
|
850
|
+
# Your training code
|
|
851
|
+
model = create_model(config.model)
|
|
852
|
+
optimizer = Adam(lr=config.lr)
|
|
853
|
+
|
|
854
|
+
# Use __num_halved__ for early stopping
|
|
855
|
+
num_epochs = compute_epochs(config.__num_halved__)
|
|
856
|
+
|
|
857
|
+
# Train and return metric
|
|
858
|
+
val_acc = train_and_evaluate(model, optimizer, num_epochs)
|
|
859
|
+
return val_acc
|
|
860
|
+
|
|
861
|
+
if __name__ == '__main__':
|
|
862
|
+
# Run hyperparameter search
|
|
863
|
+
best_result = train()
|
|
864
|
+
print(f"Best config: {best_result.config}")
|
|
865
|
+
print(f"Best metric: {best_result.metric}")
|
|
866
|
+
```
|
|
867
|
+
|
|
868
|
+
### Parameter Types
|
|
869
|
+
|
|
870
|
+
| Type | Description | Example |
|
|
871
|
+
|------|-------------|---------|
|
|
872
|
+
| `FLOAT` | Continuous values | Learning rate, dropout |
|
|
873
|
+
| `INTEGER` | Discrete integers | Batch size, num layers |
|
|
874
|
+
| `CATEGORY` | Categorical choices | Model type, optimizer |
|
|
875
|
+
|
|
876
|
+
Space types:
|
|
877
|
+
- `LOG`: Logarithmic spacing (good for learning rates)
|
|
878
|
+
- `LINEAR`: Linear spacing (default)
|
|
879
|
+
|
|
880
|
+
### Distributed Search
|
|
881
|
+
|
|
882
|
+
```python
|
|
883
|
+
from ato.hyperopt.hyperband import DistributedHyperBand
|
|
884
|
+
import torch.distributed as dist
|
|
885
|
+
|
|
886
|
+
# Initialize distributed training
|
|
887
|
+
dist.init_process_group(backend='nccl')
|
|
888
|
+
rank = dist.get_rank()
|
|
889
|
+
world_size = dist.get_world_size()
|
|
890
|
+
|
|
891
|
+
# Create distributed hyperband
|
|
892
|
+
hyperband = DistributedHyperBand(
|
|
893
|
+
scope,
|
|
894
|
+
search_spaces,
|
|
895
|
+
halving_rate=0.3,
|
|
896
|
+
num_min_samples=3,
|
|
897
|
+
mode='max',
|
|
898
|
+
rank=rank,
|
|
899
|
+
world_size=world_size,
|
|
900
|
+
backend='pytorch'
|
|
901
|
+
)
|
|
902
|
+
|
|
903
|
+
@hyperband.main
|
|
904
|
+
def train(config):
|
|
905
|
+
# Your distributed training code
|
|
906
|
+
model = create_model(config)
|
|
907
|
+
model = DDP(model, device_ids=[rank])
|
|
908
|
+
metric = train_and_evaluate(model)
|
|
909
|
+
return metric
|
|
910
|
+
|
|
911
|
+
if __name__ == '__main__':
|
|
912
|
+
result = train()
|
|
913
|
+
if rank == 0:
|
|
914
|
+
print(f"Best config: {result.config}")
|
|
915
|
+
```
|
|
916
|
+
|
|
917
|
+
---
|
|
918
|
+
|
|
919
|
+
## Best Practices
|
|
920
|
+
|
|
921
|
+
### 1. Project Structure
|
|
922
|
+
|
|
923
|
+
```
|
|
924
|
+
my_project/
|
|
925
|
+
├── configs/
|
|
926
|
+
│ ├── default.py # Default config with @scope.observe(default=True)
|
|
927
|
+
│ ├── models.py # Model-specific configs
|
|
928
|
+
│ └── datasets.py # Dataset configs
|
|
929
|
+
├── train.py # Main training script
|
|
930
|
+
├── experiments.db # SQLite experiment tracking
|
|
931
|
+
└── experiments/
|
|
932
|
+
├── run_001/
|
|
933
|
+
│ ├── checkpoints/
|
|
934
|
+
│ └── logs/
|
|
935
|
+
└── run_002/
|
|
936
|
+
```
|
|
937
|
+
|
|
938
|
+
### 2. Config Organization
|
|
939
|
+
|
|
940
|
+
```python
|
|
941
|
+
# configs/default.py
|
|
942
|
+
from ato.scope import Scope
|
|
943
|
+
from ato.adict import ADict
|
|
944
|
+
|
|
945
|
+
scope = Scope()
|
|
946
|
+
|
|
947
|
+
@scope.observe(default=True)
|
|
948
|
+
def defaults(config):
|
|
949
|
+
# Data
|
|
950
|
+
config.data = ADict(
|
|
951
|
+
dataset='cifar10',
|
|
952
|
+
batch_size=32,
|
|
953
|
+
num_workers=4
|
|
954
|
+
)
|
|
955
|
+
|
|
956
|
+
# Model
|
|
957
|
+
config.model = ADict(
|
|
958
|
+
backbone='resnet50',
|
|
959
|
+
pretrained=True
|
|
960
|
+
)
|
|
961
|
+
|
|
962
|
+
# Training
|
|
963
|
+
config.train = ADict(
|
|
964
|
+
lr=0.001,
|
|
965
|
+
epochs=100,
|
|
966
|
+
optimizer='adam'
|
|
967
|
+
)
|
|
968
|
+
|
|
969
|
+
# Experiment tracking
|
|
970
|
+
config.experiment = ADict(
|
|
971
|
+
project_name='my_project',
|
|
972
|
+
sql=ADict(db_path='sqlite:///experiments.db')
|
|
973
|
+
)
|
|
974
|
+
```
|
|
975
|
+
|
|
976
|
+
### 3. Combined Workflow
|
|
977
|
+
|
|
978
|
+
```python
|
|
979
|
+
from ato.scope import Scope
|
|
980
|
+
from ato.db_routers.sql.manager import SQLLogger
|
|
981
|
+
from configs.default import scope
|
|
982
|
+
|
|
983
|
+
@scope
|
|
984
|
+
def train(config):
|
|
985
|
+
# Setup experiment tracking
|
|
986
|
+
logger = SQLLogger(config)
|
|
987
|
+
run_id = logger.run(tags=[config.model.backbone, config.data.dataset])
|
|
988
|
+
|
|
989
|
+
try:
|
|
990
|
+
# Training loop
|
|
991
|
+
for epoch in range(config.train.epochs):
|
|
992
|
+
loss = train_epoch()
|
|
993
|
+
acc = validate()
|
|
994
|
+
|
|
995
|
+
logger.log_metric('loss', loss, epoch)
|
|
996
|
+
logger.log_metric('accuracy', acc, epoch)
|
|
997
|
+
|
|
998
|
+
logger.finish(status='completed')
|
|
999
|
+
|
|
1000
|
+
except Exception as e:
|
|
1001
|
+
logger.finish(status='failed')
|
|
1002
|
+
raise e
|
|
1003
|
+
|
|
1004
|
+
if __name__ == '__main__':
|
|
1005
|
+
train()
|
|
1006
|
+
```
|
|
1007
|
+
|
|
1008
|
+
### 4. Reproducibility Checklist
|
|
1009
|
+
|
|
1010
|
+
- ✅ Use structural hashing to track config changes
|
|
1011
|
+
- ✅ Log all hyperparameters to SQLLogger
|
|
1012
|
+
- ✅ Tag experiments with meaningful labels
|
|
1013
|
+
- ✅ Track artifacts (checkpoints, plots)
|
|
1014
|
+
- ✅ Use lazy configs for derived parameters
|
|
1015
|
+
- ✅ Document configs with `@scope.manual`
|
|
1016
|
+
- ✅ Add code fingerprinting to key functions
|
|
1017
|
+
- ✅ Add runtime fingerprinting to critical outputs
|
|
1018
|
+
|
|
1019
|
+
---
|
|
1020
|
+
|
|
1021
|
+
## FAQ
|
|
1022
|
+
|
|
1023
|
+
### Does Ato replace Hydra?
|
|
1024
|
+
|
|
1025
|
+
No. Hydra is excellent at config composition.
|
|
1026
|
+
Ato is a layer that explains **why** results differ — it observes and fingerprints the final merged config.
|
|
1027
|
+
|
|
1028
|
+
Use them together: Hydra for composition, Ato for causality.
|
|
1029
|
+
|
|
1030
|
+
### Does Ato conflict with MLflow/W&B?
|
|
1031
|
+
|
|
1032
|
+
No. MLflow/W&B provide dashboards and cloud tracking.
|
|
1033
|
+
Ato provides local causality tracking (config reasoning + code fingerprinting).
|
|
1034
|
+
|
|
1035
|
+
Use them together: MLflow/W&B for metrics/dashboards, Ato for "why did this change?"
|
|
1036
|
+
|
|
1037
|
+
### Do I need a server?
|
|
1038
|
+
|
|
1039
|
+
No. Ato uses local SQLite. Zero setup, zero network calls.
|
|
1040
|
+
|
|
1041
|
+
### Can I use Ato with my existing config files?
|
|
1042
|
+
|
|
1043
|
+
Yes. Ato is format-agnostic:
|
|
1044
|
+
- Load YAML/JSON/TOML → Ato fingerprints the result
|
|
1045
|
+
- Import OpenMMLab configs → Ato handles `_base_` inheritance
|
|
1046
|
+
- Use argparse → Ato integrates seamlessly
|
|
1047
|
+
|
|
1048
|
+
### What if I already have experiment tracking?
|
|
1049
|
+
|
|
1050
|
+
Keep it. Ato complements existing tracking:
|
|
1051
|
+
- Your tracking: metrics, artifacts, dashboards
|
|
1052
|
+
- Ato: config reasoning, code fingerprinting, causality
|
|
1053
|
+
|
|
1054
|
+
No migration required.
|
|
1055
|
+
|
|
1056
|
+
### Is Ato production-ready?
|
|
1057
|
+
|
|
1058
|
+
Yes. Ato has ~100 unit tests that pass on every release.
|
|
1059
|
+
Python codebase is ~10 files — small, readable, auditable.
|
|
1060
|
+
|
|
1061
|
+
### What's the performance overhead?
|
|
1062
|
+
|
|
1063
|
+
Minimal:
|
|
1064
|
+
- Config fingerprinting: microseconds
|
|
1065
|
+
- Code fingerprinting: happens once at decoration time
|
|
1066
|
+
- Runtime fingerprinting: depends on `inspect_fn` complexity
|
|
1067
|
+
- SQLite logging: milliseconds per metric
|
|
1068
|
+
|
|
1069
|
+
### Can I self-host?
|
|
1070
|
+
|
|
1071
|
+
Ato runs entirely locally. There's nothing to host.
|
|
1072
|
+
If you need centralized tracking, use MLflow/W&B alongside Ato.
|
|
1073
|
+
|
|
1074
|
+
---
|
|
1075
|
+
|
|
1076
|
+
## Quality Signals
|
|
1077
|
+
|
|
1078
|
+
**Every release passes 100+ unit tests.**
|
|
1079
|
+
No unchecked code. No silent failure.
|
|
1080
|
+
|
|
1081
|
+
This isn't a feature. It's a commitment.
|
|
1082
|
+
|
|
1083
|
+
When you fingerprint experiments, you're trusting the fingerprints are correct.
|
|
1084
|
+
When you merge configs, you're trusting the merge order is deterministic.
|
|
1085
|
+
When you trace code, you're trusting the bytecode hashing is stable.
|
|
1086
|
+
|
|
1087
|
+
Ato has zero tolerance for regressions.
|
|
1088
|
+
|
|
1089
|
+
Tests cover every module — ADict, Scope, MultiScope, SQLTracker, HyperBand — and every edge case we've encountered in production use.
|
|
1090
|
+
|
|
1091
|
+
```bash
|
|
1092
|
+
python -m pytest unit_tests/ # Run locally. Always passes.
|
|
1093
|
+
```
|
|
1094
|
+
|
|
1095
|
+
**If a test fails, the release doesn't ship. Period.**
|
|
1096
|
+
|
|
1097
|
+
**Codebase size:** ~10 Python files
|
|
1098
|
+
Small, readable, auditable. No magic, no metaprogramming.
|
|
1099
|
+
|
|
1100
|
+
---
|
|
1101
|
+
|
|
1102
|
+
## Requirements
|
|
1103
|
+
|
|
1104
|
+
- Python >= 3.7 (Python >= 3.8 required for lazy evaluation features)
|
|
1105
|
+
- SQLAlchemy (for SQL Tracker)
|
|
1106
|
+
- PyYAML, toml (for config serialization)
|
|
1107
|
+
|
|
1108
|
+
See `pyproject.toml` for full dependencies.
|
|
1109
|
+
|
|
1110
|
+
---
|
|
1111
|
+
|
|
1112
|
+
## Contributing
|
|
1113
|
+
|
|
1114
|
+
Contributions are welcome! Please feel free to submit issues or pull requests.
|
|
1115
|
+
|
|
1116
|
+
### Development Setup
|
|
1117
|
+
|
|
1118
|
+
```bash
|
|
1119
|
+
git clone https://github.com/Dirac-Robot/ato.git
|
|
1120
|
+
cd ato
|
|
1121
|
+
pip install -e .
|
|
1122
|
+
```
|
|
1123
|
+
|
|
1124
|
+
### Running Tests
|
|
1125
|
+
|
|
1126
|
+
```bash
|
|
1127
|
+
python -m pytest unit_tests/
|
|
1128
|
+
```
|
|
1129
|
+
|
|
1130
|
+
---
|
|
1131
|
+
|
|
1132
|
+
## License
|
|
1133
|
+
|
|
1134
|
+
MIT License
|