@synsci/cli-darwin-x64 1.1.80 → 1.1.82
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -65,9 +65,44 @@ If Modal CLI isn't installed: `pip install modal` (no `modal setup` needed — e
|
|
|
65
65
|
|
|
66
66
|
| Command | When to Use |
|
|
67
67
|
|---------|-------------|
|
|
68
|
-
| `modal run script.py` |
|
|
68
|
+
| `modal run script.py` | Quick jobs that complete in minutes |
|
|
69
69
|
| `modal serve script.py` | Development: live reload on code changes, test endpoints locally |
|
|
70
70
|
| `modal deploy script.py` | Production: persistent endpoints, scheduled jobs, always-on services |
|
|
71
|
+
| `modal deploy` + `Function.lookup().spawn()` | **Long-running training (disconnect-safe)** |
|
|
72
|
+
|
|
73
|
+
### CRITICAL: Long-Running Training Must Be Disconnect-Safe
|
|
74
|
+
|
|
75
|
+
**NEVER use `.spawn()` or `.remote()` from `@app.local_entrypoint()` for training runs that take more than a few minutes.** If the local process dies (laptop battery, closed terminal, SSH disconnect), Modal tears down the app context and kills the spawned function.
|
|
76
|
+
|
|
77
|
+
**The disconnect-safe pattern:**
|
|
78
|
+
|
|
79
|
+
```python
|
|
80
|
+
# train.py — Step 1: Define your training function
|
|
81
|
+
import modal
|
|
82
|
+
|
|
83
|
+
app = modal.App("my-training")
|
|
84
|
+
volume = modal.Volume.from_name("training-data", create_if_missing=True)
|
|
85
|
+
image = modal.Image.debian_slim(python_version="3.11").uv_pip_install("torch", "transformers")
|
|
86
|
+
|
|
87
|
+
@app.function(gpu="H100", image=image, volumes={"/data": volume}, timeout=86400)
|
|
88
|
+
def train():
|
|
89
|
+
# Your training code here — runs entirely in Modal's cloud
|
|
90
|
+
...
|
|
91
|
+
volume.commit() # Save checkpoints
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
# Step 2: Deploy the app (persists independently of your machine)
|
|
96
|
+
modal deploy train.py
|
|
97
|
+
|
|
98
|
+
# Step 3: Trigger training (fire-and-forget, survives local disconnect)
|
|
99
|
+
python -c "import modal; modal.Function.lookup('my-training', 'train').spawn()"
|
|
100
|
+
|
|
101
|
+
# Step 4: Monitor from anywhere (even a different machine)
|
|
102
|
+
modal app logs my-training
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
This pattern ensures training runs are completely decoupled from your local machine. The function runs on Modal's infrastructure and persists even if you close your laptop, lose internet, or reboot.
|
|
71
106
|
|
|
72
107
|
## GPU Selection Guide
|
|
73
108
|
|
|
@@ -362,6 +397,7 @@ def my_function():
|
|
|
362
397
|
|
|
363
398
|
| Issue | Fix |
|
|
364
399
|
|-------|-----|
|
|
400
|
+
| **Training dies when laptop closes** | **NEVER use `.spawn()`/`.remote()` from `local_entrypoint()` for long jobs. Use `modal deploy` + `Function.lookup().spawn()` pattern (see Execution Modes above)** |
|
|
365
401
|
| Cold start slow | Use `@modal.enter()` for model loading, increase `container_idle_timeout`, use memory snapshots |
|
|
366
402
|
| GPU OOM | Use larger GPU, enable gradient checkpointing, use mixed precision (bf16) |
|
|
367
403
|
| Image build fails | Pin versions, use `uv_pip_install`, use multi-stage builds |
|
|
@@ -1,6 +1,103 @@
|
|
|
1
1
|
# Modal Advanced Patterns
|
|
2
2
|
|
|
3
|
-
Advanced patterns for Modal including multi-node training, distributed primitives, sandbox workflows, memory snapshots, and integration with synsc/other skills.
|
|
3
|
+
Advanced patterns for Modal including disconnect-safe training, multi-node training, distributed primitives, sandbox workflows, memory snapshots, and integration with synsc/other skills.
|
|
4
|
+
|
|
5
|
+
## Disconnect-Safe Training (CRITICAL)
|
|
6
|
+
|
|
7
|
+
**Problem:** When you launch training via `@app.local_entrypoint()` using `.spawn()` or `.remote()`, the training run is tied to your local process. If the local process dies — laptop battery, closed terminal, SSH disconnect, or coding agent crash — Modal tears down the app context and **kills the training function**, even if it was running in Modal's cloud.
|
|
8
|
+
|
|
9
|
+
**This is the #1 cause of lost training runs.** Always use the disconnect-safe pattern for any job that takes more than a few minutes.
|
|
10
|
+
|
|
11
|
+
### Anti-Pattern (DO NOT USE for long training)
|
|
12
|
+
|
|
13
|
+
```python
|
|
14
|
+
# BAD — training dies if local process dies
|
|
15
|
+
@app.local_entrypoint()
|
|
16
|
+
def main():
|
|
17
|
+
train.spawn() # DANGER: tied to this process
|
|
18
|
+
# or: train.remote() # DANGER: also tied
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
### Correct Pattern: Deploy + Lookup + Spawn
|
|
22
|
+
|
|
23
|
+
```python
|
|
24
|
+
# train.py — define training function (NO local_entrypoint needed)
|
|
25
|
+
import modal
|
|
26
|
+
|
|
27
|
+
app = modal.App("grpo-training")
|
|
28
|
+
vol = modal.Volume.from_name("training-checkpoints", create_if_missing=True)
|
|
29
|
+
image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
|
|
30
|
+
"torch", "transformers", "trl", "datasets"
|
|
31
|
+
)
|
|
32
|
+
|
|
33
|
+
@app.function(gpu="H100:4", image=image, volumes={"/checkpoints": vol}, timeout=86400)
|
|
34
|
+
def train(config: dict = None):
|
|
35
|
+
"""Training function — runs entirely in Modal cloud, no local dependency."""
|
|
36
|
+
config = config or {}
|
|
37
|
+
# ... training code ...
|
|
38
|
+
vol.commit() # Save checkpoints
|
|
39
|
+
return {"status": "complete", "checkpoints": "/checkpoints/final"}
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
# Step 1: Deploy the app (one-time, persists independently)
|
|
44
|
+
modal deploy train.py
|
|
45
|
+
|
|
46
|
+
# Step 2: Fire-and-forget launch (survives any local disconnect)
|
|
47
|
+
python -c "
|
|
48
|
+
import modal
|
|
49
|
+
fn = modal.Function.lookup('grpo-training', 'train')
|
|
50
|
+
handle = fn.spawn({'lr': 1e-5, 'epochs': 3})
|
|
51
|
+
print(f'Launched! Function call ID: {handle.object_id}')
|
|
52
|
+
"
|
|
53
|
+
|
|
54
|
+
# Step 3: Monitor from anywhere (different terminal, different machine)
|
|
55
|
+
modal app logs grpo-training
|
|
56
|
+
|
|
57
|
+
# Step 4: Check results later
|
|
58
|
+
python -c "
|
|
59
|
+
import modal
|
|
60
|
+
fn = modal.Function.lookup('grpo-training', 'train')
|
|
61
|
+
# Get result from a specific call if you saved the ID
|
|
62
|
+
# Or just check the Volume for checkpoints
|
|
63
|
+
"
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
### Quick Helper Script
|
|
67
|
+
|
|
68
|
+
For convenience, create a `launch.py` alongside your training script:
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
# launch.py — fire-and-forget launcher
|
|
72
|
+
import modal, sys, json
|
|
73
|
+
|
|
74
|
+
app_name = sys.argv[1] # e.g., "grpo-training"
|
|
75
|
+
fn_name = sys.argv[2] # e.g., "train"
|
|
76
|
+
config = json.loads(sys.argv[3]) if len(sys.argv) > 3 else {}
|
|
77
|
+
|
|
78
|
+
fn = modal.Function.lookup(app_name, fn_name)
|
|
79
|
+
handle = fn.spawn(config)
|
|
80
|
+
print(f"Training launched! ID: {handle.object_id}")
|
|
81
|
+
print(f"Monitor: modal app logs {app_name}")
|
|
82
|
+
print("Safe to close this terminal — training runs independently.")
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
# Usage:
|
|
87
|
+
modal deploy train.py
|
|
88
|
+
python launch.py grpo-training train '{"lr": 1e-5}'
|
|
89
|
+
# Now safe to close everything
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### When to Use Each Pattern
|
|
93
|
+
|
|
94
|
+
| Pattern | Safe to Disconnect? | Use For |
|
|
95
|
+
|---------|---------------------|---------|
|
|
96
|
+
| `modal run script.py` | NO | Quick tests (<5 min) |
|
|
97
|
+
| `local_entrypoint()` + `.remote()` | NO | Interactive dev, short jobs |
|
|
98
|
+
| `local_entrypoint()` + `.spawn()` | NO | **NEVER for training** |
|
|
99
|
+
| `modal deploy` + `Function.lookup().spawn()` | **YES** | Training, batch jobs, anything >5 min |
|
|
100
|
+
| `modal deploy` + REST API trigger | **YES** | CI/CD, automated pipelines |
|
|
4
101
|
|
|
5
102
|
## Multi-Node Training (Beta)
|
|
6
103
|
|
package/bin/synsc
CHANGED
|
Binary file
|