@synsci/cli-darwin-x64 1.1.80 → 1.1.81

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -65,9 +65,44 @@ If Modal CLI isn't installed: `pip install modal` (no `modal setup` needed — e
65
65
 
66
66
  | Command | When to Use |
67
67
  |---------|-------------|
68
- | `modal run script.py` | One-off jobs: training, batch processing, data pipelines |
68
+ | `modal run script.py` | Quick jobs that complete in minutes |
69
69
  | `modal serve script.py` | Development: live reload on code changes, test endpoints locally |
70
70
  | `modal deploy script.py` | Production: persistent endpoints, scheduled jobs, always-on services |
71
+ | `modal deploy` + `Function.lookup().spawn()` | **Long-running training (disconnect-safe)** |
72
+
73
+ ### CRITICAL: Long-Running Training Must Be Disconnect-Safe
74
+
75
+ **NEVER use `.spawn()` or `.remote()` from `@app.local_entrypoint()` for training runs that take more than a few minutes.** If the local process dies (laptop battery, closed terminal, SSH disconnect), Modal tears down the app context and kills the spawned function.
76
+
77
+ **The disconnect-safe pattern:**
78
+
79
+ ```python
80
+ # train.py — Step 1: Define your training function
81
+ import modal
82
+
83
+ app = modal.App("my-training")
84
+ volume = modal.Volume.from_name("training-data", create_if_missing=True)
85
+ image = modal.Image.debian_slim(python_version="3.11").uv_pip_install("torch", "transformers")
86
+
87
+ @app.function(gpu="H100", image=image, volumes={"/data": volume}, timeout=86400)
88
+ def train():
89
+ # Your training code here — runs entirely in Modal's cloud
90
+ ...
91
+ volume.commit() # Save checkpoints
92
+ ```
93
+
94
+ ```bash
95
+ # Step 2: Deploy the app (persists independently of your machine)
96
+ modal deploy train.py
97
+
98
+ # Step 3: Trigger training (fire-and-forget, survives local disconnect)
99
+ python -c "import modal; modal.Function.lookup('my-training', 'train').spawn()"
100
+
101
+ # Step 4: Monitor from anywhere (even a different machine)
102
+ modal app logs my-training
103
+ ```
104
+
105
+ This pattern ensures training runs are completely decoupled from your local machine. The function runs on Modal's infrastructure and persists even if you close your laptop, lose internet, or reboot.
71
106
 
72
107
  ## GPU Selection Guide
73
108
 
@@ -362,6 +397,7 @@ def my_function():
362
397
 
363
398
  | Issue | Fix |
364
399
  |-------|-----|
400
+ | **Training dies when laptop closes** | **NEVER use `.spawn()`/`.remote()` from `local_entrypoint()` for long jobs. Use `modal deploy` + `Function.lookup().spawn()` pattern (see Execution Modes above)** |
365
401
  | Cold start slow | Use `@modal.enter()` for model loading, increase `container_idle_timeout`, use memory snapshots |
366
402
  | GPU OOM | Use larger GPU, enable gradient checkpointing, use mixed precision (bf16) |
367
403
  | Image build fails | Pin versions, use `uv_pip_install`, use multi-stage builds |
@@ -1,6 +1,103 @@
1
1
  # Modal Advanced Patterns
2
2
 
3
- Advanced patterns for Modal including multi-node training, distributed primitives, sandbox workflows, memory snapshots, and integration with synsc/other skills.
3
+ Advanced patterns for Modal including disconnect-safe training, multi-node training, distributed primitives, sandbox workflows, memory snapshots, and integration with synsc/other skills.
4
+
5
+ ## Disconnect-Safe Training (CRITICAL)
6
+
7
+ **Problem:** When you launch training via `@app.local_entrypoint()` using `.spawn()` or `.remote()`, the training run is tied to your local process. If the local process dies — laptop battery, closed terminal, SSH disconnect, or coding agent crash — Modal tears down the app context and **kills the training function**, even if it was running in Modal's cloud.
8
+
9
+ **This is the #1 cause of lost training runs.** Always use the disconnect-safe pattern for any job that takes more than a few minutes.
10
+
11
+ ### Anti-Pattern (DO NOT USE for long training)
12
+
13
+ ```python
14
+ # BAD — training dies if local process dies
15
+ @app.local_entrypoint()
16
+ def main():
17
+ train.spawn() # DANGER: tied to this process
18
+ # or: train.remote() # DANGER: also tied
19
+ ```
20
+
21
+ ### Correct Pattern: Deploy + Lookup + Spawn
22
+
23
+ ```python
24
+ # train.py — define training function (NO local_entrypoint needed)
25
+ import modal
26
+
27
+ app = modal.App("grpo-training")
28
+ vol = modal.Volume.from_name("training-checkpoints", create_if_missing=True)
29
+ image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
30
+ "torch", "transformers", "trl", "datasets"
31
+ )
32
+
33
+ @app.function(gpu="H100:4", image=image, volumes={"/checkpoints": vol}, timeout=86400)
34
+ def train(config: dict = None):
35
+ """Training function — runs entirely in Modal cloud, no local dependency."""
36
+ config = config or {}
37
+ # ... training code ...
38
+ vol.commit() # Save checkpoints
39
+ return {"status": "complete", "checkpoints": "/checkpoints/final"}
40
+ ```
41
+
42
+ ```bash
43
+ # Step 1: Deploy the app (one-time, persists independently)
44
+ modal deploy train.py
45
+
46
+ # Step 2: Fire-and-forget launch (survives any local disconnect)
47
+ python -c "
48
+ import modal
49
+ fn = modal.Function.lookup('grpo-training', 'train')
50
+ handle = fn.spawn({'lr': 1e-5, 'epochs': 3})
51
+ print(f'Launched! Function call ID: {handle.object_id}')
52
+ "
53
+
54
+ # Step 3: Monitor from anywhere (different terminal, different machine)
55
+ modal app logs grpo-training
56
+
57
+ # Step 4: Check results later
58
+ python -c "
59
+ import modal
60
+ fn = modal.Function.lookup('grpo-training', 'train')
61
+ # Get result from a specific call if you saved the ID
62
+ # Or just check the Volume for checkpoints
63
+ "
64
+ ```
65
+
66
+ ### Quick Helper Script
67
+
68
+ For convenience, create a `launch.py` alongside your training script:
69
+
70
+ ```python
71
+ # launch.py — fire-and-forget launcher
72
+ import modal, sys, json
73
+
74
+ app_name = sys.argv[1] # e.g., "grpo-training"
75
+ fn_name = sys.argv[2] # e.g., "train"
76
+ config = json.loads(sys.argv[3]) if len(sys.argv) > 3 else {}
77
+
78
+ fn = modal.Function.lookup(app_name, fn_name)
79
+ handle = fn.spawn(config)
80
+ print(f"Training launched! ID: {handle.object_id}")
81
+ print(f"Monitor: modal app logs {app_name}")
82
+ print("Safe to close this terminal — training runs independently.")
83
+ ```
84
+
85
+ ```bash
86
+ # Usage:
87
+ modal deploy train.py
88
+ python launch.py grpo-training train '{"lr": 1e-5}'
89
+ # Now safe to close everything
90
+ ```
91
+
92
+ ### When to Use Each Pattern
93
+
94
+ | Pattern | Safe to Disconnect? | Use For |
95
+ |---------|---------------------|---------|
96
+ | `modal run script.py` | NO | Quick tests (<5 min) |
97
+ | `local_entrypoint()` + `.remote()` | NO | Interactive dev, short jobs |
98
+ | `local_entrypoint()` + `.spawn()` | NO | **NEVER for training** |
99
+ | `modal deploy` + `Function.lookup().spawn()` | **YES** | Training, batch jobs, anything >5 min |
100
+ | `modal deploy` + REST API trigger | **YES** | CI/CD, automated pipelines |
4
101
 
5
102
  ## Multi-Node Training (Beta)
6
103
 
package/bin/synsc CHANGED
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@synsci/cli-darwin-x64",
3
- "version": "1.1.80",
3
+ "version": "1.1.81",
4
4
  "os": [
5
5
  "darwin"
6
6
  ],