@synsci/cli-darwin-x64 1.1.59 → 1.1.60
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/grpo-rl-training/README.md +1 -1
- package/bin/skills/hugging-face-evaluation/examples/.env.example +7 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +0 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +0 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +0 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +0 -0
- package/bin/skills/hugging-face-trackio/.claude-plugin/plugin.json +19 -0
- package/bin/skills/modal/SKILL.md +316 -275
- package/bin/skills/modal/references/advanced-patterns.md +598 -0
- package/bin/skills/modal/references/examples-catalog.md +423 -0
- package/bin/skills/prime-intellect-lab/README.md +69 -0
- package/bin/skills/prime-intellect-lab/SKILL.md +598 -0
- package/bin/skills/prime-intellect-lab/templates/basic_rl_training.toml +82 -0
- package/bin/skills/tensorpool/SKILL.md +519 -0
- package/bin/synsc +0 -0
- package/package.json +1 -1
- package/bin/skills/modal/references/advanced-usage.md +0 -503
|
@@ -1,341 +1,382 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: modal-serverless-gpu
|
|
3
|
-
description: Serverless GPU cloud platform for
|
|
4
|
-
version:
|
|
3
|
+
description: Serverless GPU cloud platform for ML workloads — inference serving, batch processing, training, web endpoints, sandboxes, and scheduled jobs. Use when you need on-demand GPUs without infrastructure management, deploying models as auto-scaling APIs, running batch jobs, or executing untrusted code in sandboxes.
|
|
4
|
+
version: 2.0.0
|
|
5
5
|
author: Synthetic Sciences
|
|
6
6
|
license: MIT
|
|
7
|
-
tags: [Infrastructure, Serverless, GPU, Cloud, Deployment, Modal]
|
|
8
|
-
dependencies: [modal>=0.
|
|
7
|
+
tags: [Infrastructure, Serverless, GPU, Cloud, Deployment, Modal, Inference, Training, Sandboxes, Web Endpoints]
|
|
8
|
+
dependencies: [modal>=0.73.0]
|
|
9
9
|
---
|
|
10
10
|
|
|
11
11
|
# Modal Serverless GPU
|
|
12
12
|
|
|
13
|
-
|
|
13
|
+
Modal is a serverless GPU cloud platform. Everything is Python code — no YAML, no Docker, no Kubernetes. Pay per second, scale to zero, scale to hundreds of GPUs instantly.
|
|
14
14
|
|
|
15
|
-
|
|
15
|
+
This skill provides the decision framework, GPU guide, API reference, and a catalog of 50+ production-ready examples from Modal's official library. For detailed implementations, refer to the example catalog and reference docs below.
|
|
16
16
|
|
|
17
|
-
|
|
18
|
-
- Running GPU-intensive ML workloads without managing infrastructure
|
|
19
|
-
- Deploying ML models as auto-scaling APIs
|
|
20
|
-
- Running batch processing jobs (training, inference, data processing)
|
|
21
|
-
- Need pay-per-second GPU pricing without idle costs
|
|
22
|
-
- Prototyping ML applications quickly
|
|
23
|
-
- Running scheduled jobs (cron-like workloads)
|
|
17
|
+
## When to Use Modal
|
|
24
18
|
|
|
25
|
-
**
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
19
|
+
**Modal is the RIGHT choice for:**
|
|
20
|
+
|
|
21
|
+
| Workload | Why Modal |
|
|
22
|
+
|----------|-----------|
|
|
23
|
+
| **Inference serving** | Auto-scaling endpoints, zero-downtime deploys, sub-second cold starts |
|
|
24
|
+
| **Batch processing** | Fan out to 100+ GPUs with `.map()`, pay only for compute time |
|
|
25
|
+
| **Web endpoints / APIs** | FastAPI/ASGI/WSGI support, custom domains, streaming |
|
|
26
|
+
| **Sandbox execution** | Run untrusted code safely, build coding agents, code interpreters |
|
|
27
|
+
| **Scheduled jobs** | Cron/periodic with `modal.Cron` and `modal.Period` |
|
|
28
|
+
| **Full-parameter training** | Multi-GPU (up to 8), multi-node clusters (beta) |
|
|
29
|
+
| **Custom architectures** | Full control over container images, any framework |
|
|
30
|
+
| **Data pipelines** | Parallel processing, S3 mounts, Volume storage |
|
|
32
31
|
|
|
33
32
|
**Use alternatives instead:**
|
|
34
|
-
- **RunPod**: For longer-running pods with persistent state
|
|
35
|
-
- **Lambda Labs**: For reserved GPU instances
|
|
36
|
-
- **SkyPilot**: For multi-cloud orchestration and cost optimization
|
|
37
|
-
- **Kubernetes**: For complex multi-service architectures
|
|
38
33
|
|
|
39
|
-
|
|
34
|
+
| Need | Use Instead |
|
|
35
|
+
|------|-------------|
|
|
36
|
+
| Managed LoRA fine-tuning (no infra) | **Tinker** |
|
|
37
|
+
| Hosted RL / agentic post-training | **Prime Intellect Lab** |
|
|
38
|
+
| Reserved dedicated instances | **Lambda Labs** |
|
|
39
|
+
| Multi-cloud cost optimization | **SkyPilot** |
|
|
40
|
+
| Long-running persistent pods | **RunPod** |
|
|
41
|
+
|
|
42
|
+
## Credential Setup (synsc)
|
|
40
43
|
|
|
41
|
-
|
|
44
|
+
Credentials are auto-injected via SynSci. Verify before running Modal workloads:
|
|
42
45
|
|
|
43
46
|
```bash
|
|
44
|
-
|
|
45
|
-
|
|
47
|
+
# Check credentials are set (NEVER echo the actual values)
|
|
48
|
+
[ -n "$MODAL_TOKEN_ID" ] && echo "MODAL_TOKEN_ID set" || echo "NOT SET"
|
|
49
|
+
[ -n "$MODAL_TOKEN_SECRET" ] && echo "MODAL_TOKEN_SECRET set" || echo "NOT SET"
|
|
46
50
|
```
|
|
47
51
|
|
|
48
|
-
|
|
52
|
+
**IMPORTANT**: Always rely on `MODAL_TOKEN_ID`/`MODAL_TOKEN_SECRET` env vars. Do NOT read from `~/.modal.toml`.
|
|
49
53
|
|
|
50
|
-
|
|
51
|
-
import modal
|
|
54
|
+
If Modal CLI isn't installed: `pip install modal` (no `modal setup` needed — env vars handle auth).
|
|
52
55
|
|
|
53
|
-
|
|
56
|
+
## Quick Reference
|
|
54
57
|
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
58
|
+
| Topic | Reference |
|
|
59
|
+
|-------|-----------|
|
|
60
|
+
| Example Catalog (50+ examples) | [Examples Catalog](references/examples-catalog.md) |
|
|
61
|
+
| Advanced Patterns & synsc Integration | [Advanced Patterns](references/advanced-patterns.md) |
|
|
62
|
+
| Troubleshooting | [Troubleshooting](references/troubleshooting.md) |
|
|
59
63
|
|
|
60
|
-
|
|
61
|
-
def main():
|
|
62
|
-
print(gpu_info.remote())
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
Run: `modal run hello_gpu.py`
|
|
66
|
-
|
|
67
|
-
### Basic inference endpoint
|
|
64
|
+
## Execution Modes
|
|
68
65
|
|
|
66
|
+
| Command | When to Use |
|
|
67
|
+
|---------|-------------|
|
|
68
|
+
| `modal run script.py` | One-off jobs: training, batch processing, data pipelines |
|
|
69
|
+
| `modal serve script.py` | Development: live reload on code changes, test endpoints locally |
|
|
70
|
+
| `modal deploy script.py` | Production: persistent endpoints, scheduled jobs, always-on services |
|
|
71
|
+
|
|
72
|
+
## GPU Selection Guide
|
|
73
|
+
|
|
74
|
+
| GPU | VRAM | $/hr (approx) | Best For |
|
|
75
|
+
|-----|------|----------------|----------|
|
|
76
|
+
| `T4` | 16GB | ~$0.59 | Budget inference, small models (<7B quantized) |
|
|
77
|
+
| `L4` | 24GB | ~$0.73 | Inference, Ada Lovelace architecture |
|
|
78
|
+
| `A10G` | 24GB | ~$1.10 | Training/inference, 3.3x faster than T4 |
|
|
79
|
+
| `L40S` | 48GB | ~$1.65 | **Best cost/perf for inference** (7B-13B FP16) |
|
|
80
|
+
| `A100-40GB` | 40GB | ~$3.15 | Large model training |
|
|
81
|
+
| `A100-80GB` | 80GB | ~$4.05 | Very large models, DeepSpeed |
|
|
82
|
+
| `H100` | 80GB | ~$4.25 | Fastest training, FP8 + Transformer Engine |
|
|
83
|
+
| `H200` | 141GB | ~$4.95 | Largest VRAM, 4.8TB/s bandwidth |
|
|
84
|
+
| `B200` | 192GB | Latest | Blackwell architecture, newest |
|
|
85
|
+
|
|
86
|
+
**GPU specification patterns:**
|
|
69
87
|
```python
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
app
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
@app.cls(gpu="A10G", image=image)
|
|
76
|
-
class TextGenerator:
|
|
77
|
-
@modal.enter()
|
|
78
|
-
def load_model(self):
|
|
79
|
-
from transformers import pipeline
|
|
80
|
-
self.pipe = pipeline("text-generation", model="gpt2", device=0)
|
|
81
|
-
|
|
82
|
-
@modal.method()
|
|
83
|
-
def generate(self, prompt: str) -> str:
|
|
84
|
-
return self.pipe(prompt, max_length=100)[0]["generated_text"]
|
|
85
|
-
|
|
86
|
-
@app.local_entrypoint()
|
|
87
|
-
def main():
|
|
88
|
-
print(TextGenerator().generate.remote("Hello, world"))
|
|
88
|
+
@app.function(gpu="A100") # Single GPU
|
|
89
|
+
@app.function(gpu="A100-80GB") # Specific memory variant
|
|
90
|
+
@app.function(gpu="H100:4") # Multi-GPU (up to 8)
|
|
91
|
+
@app.function(gpu=["H100", "A100"]) # Fallback chain (try in order)
|
|
92
|
+
@app.function(gpu="any") # Any available GPU
|
|
89
93
|
```
|
|
90
94
|
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
|
95
|
+
**Recommendations by task:**
|
|
96
|
+
|
|
97
|
+
| Task | GPU | Config |
|
|
98
|
+
|------|-----|--------|
|
|
99
|
+
| Serve 7B model (FP16) | `L40S` or `A10G` | Single GPU |
|
|
100
|
+
| Serve 70B model (AWQ/GPTQ) | `A100-80GB` or `H100` | Single GPU |
|
|
101
|
+
| Serve 70B model (FP16) | `H100:4` or `A100-80GB:4` | Multi-GPU |
|
|
102
|
+
| LoRA fine-tune 7B | `A100-40GB` | Single GPU |
|
|
103
|
+
| Full fine-tune 7B | `A100-80GB:4` | Multi-GPU |
|
|
104
|
+
| Full fine-tune 70B | `H100:8` or multi-node | Multi-GPU/node |
|
|
105
|
+
| Batch inference | `L40S` or `A100` | `.map()` fan-out |
|
|
106
|
+
| Embedding generation | `T4` or `L4` | `.map()` fan-out |
|
|
107
|
+
|
|
108
|
+
## Core API Quick Reference
|
|
109
|
+
|
|
110
|
+
### Key Classes
|
|
111
|
+
|
|
112
|
+
| Class | Purpose | Key Methods |
|
|
113
|
+
|-------|---------|-------------|
|
|
114
|
+
| `modal.App` | Container for functions/resources | `.function()`, `.cls()`, `.local_entrypoint()` |
|
|
115
|
+
| `modal.Image` | Container image definition | `.debian_slim()`, `.uv_pip_install()`, `.pip_install()`, `.from_registry()`, `.add_local_dir()`, `.run_commands()`, `.env()` |
|
|
116
|
+
| `modal.Volume` | Persistent distributed filesystem (2.5 GB/s) | `.from_name()`, `.commit()`, `.reload()` |
|
|
117
|
+
| `modal.Secret` | Secure credential injection | `.from_name()`, `.from_dict()`, `.from_dotenv()` |
|
|
118
|
+
| `modal.Dict` | Distributed key-value store | `.from_name()`, `.put()`, `.get()`, `.pop()` |
|
|
119
|
+
| `modal.Queue` | Distributed FIFO queue | `.from_name()`, `.put()`, `.get()` |
|
|
120
|
+
| `modal.Sandbox` | Isolated code execution container | `.create()`, `.exec()`, `.terminate()`, snapshot support |
|
|
121
|
+
| `modal.Cls` | Class-based serverless functions | Used via `@app.cls()` decorator |
|
|
122
|
+
| `modal.Function` | Serverless function handle | `.remote()`, `.local()`, `.map()`, `.starmap()`, `.lookup()` |
|
|
123
|
+
| `modal.CloudBucketMount` | Mount S3/GCS buckets as filesystem | Direct bucket access |
|
|
124
|
+
| `modal.Tunnel` | Network tunnel to containers | SSH, HTTP access |
|
|
125
|
+
| `modal.Proxy` | Network proxy (beta) | Custom networking |
|
|
126
|
+
|
|
127
|
+
### Key Decorators
|
|
128
|
+
|
|
129
|
+
| Decorator | Purpose |
|
|
96
130
|
|-----------|---------|
|
|
97
|
-
| `
|
|
98
|
-
| `
|
|
99
|
-
| `
|
|
100
|
-
| `
|
|
101
|
-
| `
|
|
102
|
-
| `
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
|
107
|
-
|
|
108
|
-
|
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
| `A10G` | 24GB | Training/inference, 3.3x faster than T4 |
|
|
121
|
-
| `L40S` | 48GB | Recommended for inference (best cost/perf) |
|
|
122
|
-
| `A100-40GB` | 40GB | Large model training |
|
|
123
|
-
| `A100-80GB` | 80GB | Very large models |
|
|
124
|
-
| `H100` | 80GB | Fastest, FP8 + Transformer Engine |
|
|
125
|
-
| `H200` | 141GB | Auto-upgrade from H100, 4.8TB/s bandwidth |
|
|
126
|
-
| `B200` | Latest | Blackwell architecture |
|
|
127
|
-
|
|
128
|
-
### GPU specification patterns
|
|
131
|
+
| `@app.function()` | Define a serverless function |
|
|
132
|
+
| `@app.cls()` | Define a serverless class |
|
|
133
|
+
| `@modal.method()` | Mark class method as remotely callable |
|
|
134
|
+
| `@modal.enter()` | Run once at container startup (model loading) |
|
|
135
|
+
| `@modal.exit()` | Run at container shutdown (cleanup) |
|
|
136
|
+
| `@modal.parameter()` | Typed class parameter |
|
|
137
|
+
| `@modal.fastapi_endpoint()` | Expose function as FastAPI endpoint |
|
|
138
|
+
| `@modal.asgi_app()` | Expose full ASGI app (FastAPI/Starlette) |
|
|
139
|
+
| `@modal.wsgi_app()` | Expose WSGI app (Django/Flask) |
|
|
140
|
+
| `@modal.web_server(port)` | Expose arbitrary HTTP server |
|
|
141
|
+
| `@modal.batched()` | Dynamic input batching |
|
|
142
|
+
| `@modal.concurrent()` | Input concurrency control |
|
|
143
|
+
|
|
144
|
+
### Scheduling
|
|
145
|
+
|
|
146
|
+
| Type | Syntax |
|
|
147
|
+
|------|--------|
|
|
148
|
+
| Cron | `schedule=modal.Cron("0 0 * * *")` (always UTC) |
|
|
149
|
+
| Periodic | `schedule=modal.Period(hours=1)` |
|
|
150
|
+
|
|
151
|
+
## Essential Patterns
|
|
152
|
+
|
|
153
|
+
### Pattern 1: Model Inference Service
|
|
129
154
|
|
|
130
155
|
```python
|
|
131
|
-
|
|
132
|
-
@app.function(gpu="A100")
|
|
133
|
-
|
|
134
|
-
# Specific memory variant
|
|
135
|
-
@app.function(gpu="A100-80GB")
|
|
136
|
-
|
|
137
|
-
# Multiple GPUs (up to 8)
|
|
138
|
-
@app.function(gpu="H100:4")
|
|
139
|
-
|
|
140
|
-
# GPU with fallbacks
|
|
141
|
-
@app.function(gpu=["H100", "A100", "L40S"])
|
|
142
|
-
|
|
143
|
-
# Any available GPU
|
|
144
|
-
@app.function(gpu="any")
|
|
145
|
-
```
|
|
146
|
-
|
|
147
|
-
## Container images
|
|
156
|
+
import modal
|
|
148
157
|
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
"torch==2.1.0", "transformers==4.36.0", "accelerate"
|
|
158
|
+
app = modal.App("inference")
|
|
159
|
+
image = modal.Image.debian_slim(python_version="3.11").uv_pip_install(
|
|
160
|
+
"torch", "transformers", "accelerate"
|
|
153
161
|
)
|
|
154
|
-
|
|
155
|
-
# From CUDA base
|
|
156
|
-
image = modal.Image.from_registry(
|
|
157
|
-
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
|
|
158
|
-
add_python="3.11"
|
|
159
|
-
).pip_install("torch", "transformers")
|
|
160
|
-
|
|
161
|
-
# With system packages
|
|
162
|
-
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
|
|
163
|
-
```
|
|
164
|
-
|
|
165
|
-
## Persistent storage
|
|
166
|
-
|
|
167
|
-
```python
|
|
168
162
|
volume = modal.Volume.from_name("model-cache", create_if_missing=True)
|
|
169
163
|
|
|
170
|
-
@app.
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
volume.commit() # Persist changes
|
|
178
|
-
return load_from_path(model_path)
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
## Web endpoints
|
|
182
|
-
|
|
183
|
-
### FastAPI endpoint decorator
|
|
164
|
+
@app.cls(gpu="L40S", image=image, volumes={"/models": volume},
|
|
165
|
+
container_idle_timeout=300)
|
|
166
|
+
class InferenceService:
|
|
167
|
+
@modal.enter()
|
|
168
|
+
def load(self):
|
|
169
|
+
from transformers import pipeline
|
|
170
|
+
self.pipe = pipeline("text-generation", model="/models/my-model", device=0)
|
|
184
171
|
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
def predict(text: str) -> dict:
|
|
189
|
-
return {"result": model.predict(text)}
|
|
172
|
+
@modal.method()
|
|
173
|
+
def generate(self, prompt: str) -> str:
|
|
174
|
+
return self.pipe(prompt, max_length=512)[0]["generated_text"]
|
|
190
175
|
```
|
|
191
176
|
|
|
192
|
-
###
|
|
177
|
+
### Pattern 2: vLLM Deployment (see Modal example: `llm_inference`)
|
|
193
178
|
|
|
194
179
|
```python
|
|
195
|
-
|
|
196
|
-
web_app = FastAPI()
|
|
180
|
+
import modal
|
|
197
181
|
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
182
|
+
app = modal.App("vllm-server")
|
|
183
|
+
image = modal.Image.debian_slim(python_version="3.11").uv_pip_install("vllm")
|
|
184
|
+
volume = modal.Volume.from_name("model-weights", create_if_missing=True)
|
|
201
185
|
|
|
202
|
-
@app.function(
|
|
186
|
+
@app.function(gpu="H100", image=image, volumes={"/models": volume},
|
|
187
|
+
container_idle_timeout=600, timeout=3600)
|
|
203
188
|
@modal.asgi_app()
|
|
204
|
-
def
|
|
205
|
-
|
|
189
|
+
def serve():
|
|
190
|
+
# See Modal example `llm_inference` for the full implementation
|
|
191
|
+
...
|
|
206
192
|
```
|
|
207
193
|
|
|
208
|
-
###
|
|
209
|
-
|
|
210
|
-
| Decorator | Use Case |
|
|
211
|
-
|-----------|----------|
|
|
212
|
-
| `@modal.fastapi_endpoint()` | Simple function → API |
|
|
213
|
-
| `@modal.asgi_app()` | Full FastAPI/Starlette apps |
|
|
214
|
-
| `@modal.wsgi_app()` | Django/Flask apps |
|
|
215
|
-
| `@modal.web_server(port)` | Arbitrary HTTP servers |
|
|
216
|
-
|
|
217
|
-
## Dynamic batching
|
|
194
|
+
### Pattern 3: Batch Processing with Fan-Out
|
|
218
195
|
|
|
219
196
|
```python
|
|
220
|
-
@app.function()
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
# Inputs automatically batched
|
|
224
|
-
return model.batch_predict(inputs)
|
|
225
|
-
```
|
|
226
|
-
|
|
227
|
-
## Secrets management
|
|
228
|
-
|
|
229
|
-
```bash
|
|
230
|
-
# Create secret
|
|
231
|
-
modal secret create huggingface HF_TOKEN=hf_xxx
|
|
232
|
-
```
|
|
233
|
-
|
|
234
|
-
```python
|
|
235
|
-
@app.function(secrets=[modal.Secret.from_name("huggingface")])
|
|
236
|
-
def download_model():
|
|
237
|
-
import os
|
|
238
|
-
token = os.environ["HF_TOKEN"]
|
|
239
|
-
```
|
|
240
|
-
|
|
241
|
-
## Scheduling
|
|
242
|
-
|
|
243
|
-
```python
|
|
244
|
-
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily midnight
|
|
245
|
-
def daily_job():
|
|
246
|
-
pass
|
|
197
|
+
@app.function(gpu="T4")
|
|
198
|
+
def process_item(item):
|
|
199
|
+
return expensive_computation(item)
|
|
247
200
|
|
|
248
|
-
@app.
|
|
249
|
-
def
|
|
250
|
-
|
|
201
|
+
@app.local_entrypoint()
|
|
202
|
+
def main():
|
|
203
|
+
items = list(range(10000))
|
|
204
|
+
results = list(process_item.map(items)) # Fan out to parallel GPUs
|
|
251
205
|
```
|
|
252
206
|
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
### Cold start mitigation
|
|
207
|
+
### Pattern 4: Container Image (use uv for speed)
|
|
256
208
|
|
|
257
209
|
```python
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
210
|
+
# Prefer uv_pip_install — 10-50x faster than pip_install
|
|
211
|
+
image = (
|
|
212
|
+
modal.Image.debian_slim(python_version="3.11")
|
|
213
|
+
.uv_pip_install("torch", "transformers", "accelerate", "vllm")
|
|
214
|
+
.add_local_dir("./src", "/app/src") # Add local code
|
|
215
|
+
.env({"HF_HOME": "/models"}) # Set env vars
|
|
261
216
|
)
|
|
262
|
-
def inference():
|
|
263
|
-
pass
|
|
264
|
-
```
|
|
265
|
-
|
|
266
|
-
### Model loading best practices
|
|
267
|
-
|
|
268
|
-
```python
|
|
269
|
-
@app.cls(gpu="A100")
|
|
270
|
-
class Model:
|
|
271
|
-
@modal.enter() # Run once at container start
|
|
272
|
-
def load(self):
|
|
273
|
-
self.model = load_model() # Load during warm-up
|
|
274
|
-
|
|
275
|
-
@modal.method()
|
|
276
|
-
def predict(self, x):
|
|
277
|
-
return self.model(x)
|
|
278
217
|
```
|
|
279
218
|
|
|
280
|
-
|
|
219
|
+
### Pattern 5: Sandbox (Code Execution)
|
|
281
220
|
|
|
282
221
|
```python
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
@app.function()
|
|
288
|
-
def run_parallel():
|
|
289
|
-
items = list(range(1000))
|
|
290
|
-
# Fan out to parallel containers
|
|
291
|
-
results = list(process_item.map(items))
|
|
292
|
-
return results
|
|
222
|
+
sandbox = modal.Sandbox.create(app=app, image=image, gpu="T4", timeout=300)
|
|
223
|
+
process = sandbox.exec("python", "-c", "print('Hello from sandbox')")
|
|
224
|
+
print(process.stdout.read())
|
|
225
|
+
sandbox.terminate()
|
|
293
226
|
```
|
|
294
227
|
|
|
295
|
-
##
|
|
228
|
+
## Example Catalog (Quick Lookup)
|
|
229
|
+
|
|
230
|
+
Modal's official example library contains production-ready implementations. Find the right example for your task below, then refer to [Examples Catalog](references/examples-catalog.md) for expanded descriptions and implementation notes.
|
|
231
|
+
|
|
232
|
+
### LLM Inference & Serving
|
|
233
|
+
|
|
234
|
+
| Example | Description | Key Features |
|
|
235
|
+
|---------|-------------|--------------|
|
|
236
|
+
| `llm_inference` | Deploy OpenAI-compatible LLM service | vLLM, H100, streaming, OpenAI API |
|
|
237
|
+
| `very_large_models` | Deploy really big LLMs (DeepSeek V3, Kimi-K2) | SGLang, multi-GPU (H200:4-8), 100B+ params |
|
|
238
|
+
| `ministral3_inference` | 10x cold start reduction with snapshots | Memory snapshots, fast startup |
|
|
239
|
+
| `vllm_throughput` | Optimize tokens/sec batch processing | vLLM, ~30K input tok/s per H100 |
|
|
240
|
+
| `sglang_low_latency` | Low-latency inference with SGLang | SGLang, speculative decoding, EAGLE-3 |
|
|
241
|
+
| `llama_cpp` | Run GGUF models with llama.cpp | CPU/GPU inference, quantized models |
|
|
242
|
+
| `trtllm_latency` | Low-latency with TensorRT-LLM | TensorRT optimization |
|
|
243
|
+
| `trtllm_throughput` | High-throughput with TensorRT-LLM | Batch TensorRT inference |
|
|
244
|
+
|
|
245
|
+
### Training & Fine-Tuning
|
|
246
|
+
|
|
247
|
+
| Example | Description | Key Features |
|
|
248
|
+
|---------|-------------|--------------|
|
|
249
|
+
| `grpo_verl` | GRPO math training with verl | RL training, math reasoning |
|
|
250
|
+
| `grpo_trl` | GRPO coding training with TRL | RL training, code generation |
|
|
251
|
+
| `unsloth_finetune` | Efficient fine-tuning with Unsloth | LoRA, 2x speed, memory efficient |
|
|
252
|
+
| `hp_sweep_gpt` | Train SLM with hyperparameter search | Grid search, early stopping |
|
|
253
|
+
| `long-training` | Long, resumable training jobs | Checkpointing, Volume, resume |
|
|
254
|
+
| `llm-finetuning` | Full LLM fine-tuning pipeline | End-to-end training |
|
|
255
|
+
| `flan_t5_finetune` | Fine-tune Flan-T5 | Seq2seq fine-tuning |
|
|
256
|
+
| `diffusers_lora_finetune` | Fine-tune Flux with LoRA | Image generation LoRA |
|
|
257
|
+
|
|
258
|
+
### Multimodal & Vision
|
|
259
|
+
|
|
260
|
+
| Example | Description | Key Features |
|
|
261
|
+
|---------|-------------|--------------|
|
|
262
|
+
| `flux` | Serve diffusion models with torch.compile | Image generation, compilation |
|
|
263
|
+
| `text_to_image` | Stable Diffusion CLI/API/UI | Text-to-image, Gradio |
|
|
264
|
+
| `image_to_image` | Edit images with Flux Kontext | Image-to-image |
|
|
265
|
+
| `image_to_video` | Bring images to life with LTX-Video | Video generation |
|
|
266
|
+
| `ltx` | Generate video with LTX-Video | Text-to-video |
|
|
267
|
+
| `finetune_yolo` | Fine-tune & serve YOLO | Object detection |
|
|
268
|
+
| `segment_anything` | Segment Anything Model | Image segmentation |
|
|
269
|
+
| `comfyapp` | Run Flux on ComfyUI as API | ComfyUI, workflow API |
|
|
270
|
+
| `blender_video` | 3D render farm with Blender | 3D rendering, parallelism |
|
|
271
|
+
|
|
272
|
+
### Audio & Speech
|
|
273
|
+
|
|
274
|
+
| Example | Description | Key Features |
|
|
275
|
+
|---------|-------------|--------------|
|
|
276
|
+
| `llm-voice-chat` | Voice chat with LLMs | Real-time voice, WebSocket |
|
|
277
|
+
| `streaming_kyutai_stt` | Transcribe speech with Kyutai STT | Streaming STT, low latency |
|
|
278
|
+
| `music-video-gen` | Star in custom music videos | Multi-model pipeline |
|
|
279
|
+
| `generate_music` | Make music with ACE-Step | Music generation |
|
|
280
|
+
| `chatterbox_tts` | Generate speech with Chatterbox | TTS |
|
|
281
|
+
| `batched_whisper` | High-throughput Whisper transcription | Batch ASR, Whisper |
|
|
282
|
+
| `fine_tune_asr` | Fine-tune Whisper for new words | ASR fine-tuning |
|
|
283
|
+
|
|
284
|
+
### Sandboxes & Code Execution
|
|
285
|
+
|
|
286
|
+
| Example | Description | Key Features |
|
|
287
|
+
|---------|-------------|--------------|
|
|
288
|
+
| `agent` | Sandbox a LangGraph agent's code | LangGraph, secure GPU sandbox |
|
|
289
|
+
| `opencode_server` | Run background coding agent (OpenCode) | Coding agent, sandbox |
|
|
290
|
+
| `modal-vibe` | Deploy vibe coding at scale | React + LLM + Sandboxes |
|
|
291
|
+
| `safe_code_execution` | Run Node.js, Ruby, and more in sandbox | Multi-language, sandbox |
|
|
292
|
+
| `simple_code_interpreter` | Stateful code interpreter | Jupyter-like, sandbox |
|
|
293
|
+
| `jupyter_sandbox` | Sandboxed Jupyter notebook | Jupyter, sandbox |
|
|
294
|
+
| `anthropic_computer_use` | Control computer with LLM | Computer use, sandbox |
|
|
295
|
+
|
|
296
|
+
### RAG & Embeddings
|
|
297
|
+
|
|
298
|
+
| Example | Description | Key Features |
|
|
299
|
+
|---------|-------------|--------------|
|
|
300
|
+
| `chat_with_pdf_vision` | RAG Chat with PDFs | PDF Q&A, vision |
|
|
301
|
+
| `amazon_embeddings` | Embed millions of docs with TEI | High-throughput embeddings |
|
|
302
|
+
| `mongodb-search` | Satellite images to vectors + MongoDB | Image embeddings, geo search |
|
|
303
|
+
| `potus_speech_qanda` | RAG Q&A chatbot with OpenAI | RAG, OpenAI |
|
|
304
|
+
|
|
305
|
+
### Web Apps & Endpoints
|
|
306
|
+
|
|
307
|
+
| Example | Description | Key Features |
|
|
308
|
+
|---------|-------------|--------------|
|
|
309
|
+
| `basic_web` | Serving web endpoints | FastAPI, ASGI |
|
|
310
|
+
| `serve_streamlit` | Deploy Streamlit apps | Streamlit |
|
|
311
|
+
| `mcp_server_stateless` | Deploy stateless MCP with FastMCP | MCP, tool serving |
|
|
312
|
+
| `webrtc_yolo` | Serverless WebRTC with YOLO | WebRTC, real-time |
|
|
313
|
+
| `fastrtc_flip_webcam` | WebRTC quickstart with FastRTC | FastRTC |
|
|
314
|
+
| `webscraper` | Simple web scraper | Scraping, parallelism |
|
|
315
|
+
|
|
316
|
+
### Data & Infrastructure
|
|
317
|
+
|
|
318
|
+
| Example | Description | Key Features |
|
|
319
|
+
|---------|-------------|--------------|
|
|
320
|
+
| `s3_bucket_mount` | Parallel Parquet processing on S3 | CloudBucketMount, S3 |
|
|
321
|
+
| `cloud_bucket_mount_loras` | LoRA playground with S3 + Gradio | LoRA management, S3 |
|
|
322
|
+
| `dbt_duckdb` | Data warehouse with DuckDB + DBT | Analytics, data warehouse |
|
|
323
|
+
| `doc_ocr_jobs` | Document OCR job queue | Job queue, OCR |
|
|
324
|
+
| `doc_ocr_webapp` | Document OCR web app | Web app, OCR |
|
|
325
|
+
| `hackernews_alerts` | Hacker News Slackbot | Scheduled jobs, Slack |
|
|
326
|
+
| `discord_bot` | Deploy a Discord bot | Discord, bot |
|
|
327
|
+
| `db_to_sheet` | Sync DB to Google Sheets | Google Sheets, ETL |
|
|
328
|
+
| `cron_datasette` | Publish data with SQLite + Datasette | Data exploration |
|
|
329
|
+
| `algolia_indexer` | Build docsearch with Algolia | Documentation search |
|
|
330
|
+
|
|
331
|
+
### Computational Biology
|
|
332
|
+
|
|
333
|
+
| Example | Description | Key Features |
|
|
334
|
+
|---------|-------------|--------------|
|
|
335
|
+
| `chai1` | Fold proteins with Chai-1 | Protein folding |
|
|
336
|
+
| `boltz_predict` | Fold proteins with Boltz-2 | Protein structure |
|
|
337
|
+
| `esm3` | ESM3 protein model | Protein language model |
|
|
338
|
+
|
|
339
|
+
## Common Configuration Reference
|
|
296
340
|
|
|
297
341
|
```python
|
|
298
342
|
@app.function(
|
|
299
|
-
gpu="A100",
|
|
300
|
-
memory=32768,
|
|
301
|
-
cpu=4,
|
|
302
|
-
timeout=3600,
|
|
303
|
-
container_idle_timeout=120
|
|
304
|
-
retries=3,
|
|
305
|
-
concurrency_limit=10,
|
|
343
|
+
gpu="A100", # GPU type (see selection guide)
|
|
344
|
+
memory=32768, # RAM in MB
|
|
345
|
+
cpu=4, # CPU cores
|
|
346
|
+
timeout=3600, # Max execution time (seconds)
|
|
347
|
+
container_idle_timeout=120, # Keep container warm (seconds)
|
|
348
|
+
retries=modal.Retries(max_retries=3, backoff_coefficient=2.0),
|
|
349
|
+
concurrency_limit=10, # Max concurrent containers
|
|
350
|
+
allow_concurrent_inputs=20, # Requests per container
|
|
351
|
+
keep_warm=1, # Min warm containers (costs money)
|
|
352
|
+
volumes={"/data": volume}, # Mount volumes
|
|
353
|
+
secrets=[modal.Secret.from_name("my-secret")],
|
|
354
|
+
image=image, # Custom container image
|
|
355
|
+
schedule=modal.Cron("0 0 * * *"), # Cron schedule (UTC)
|
|
306
356
|
)
|
|
307
357
|
def my_function():
|
|
308
358
|
pass
|
|
309
359
|
```
|
|
310
360
|
|
|
311
|
-
##
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
|
|
317
|
-
|
|
318
|
-
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
- **[Advanced Usage](references/advanced-usage.md)** - Multi-GPU, distributed training, cost optimization
|
|
334
|
-
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
|
|
335
|
-
|
|
336
|
-
## Resources
|
|
337
|
-
|
|
338
|
-
- **Documentation**: https://modal.com/docs
|
|
339
|
-
- **Examples**: https://github.com/modal-labs/modal-examples
|
|
340
|
-
- **Pricing**: https://modal.com/pricing
|
|
341
|
-
- **Discord**: https://discord.gg/modal
|
|
361
|
+
## Common Issues Quick-Fix
|
|
362
|
+
|
|
363
|
+
| Issue | Fix |
|
|
364
|
+
|-------|-----|
|
|
365
|
+
| Cold start slow | Use `@modal.enter()` for model loading, increase `container_idle_timeout`, use memory snapshots |
|
|
366
|
+
| GPU OOM | Use larger GPU, enable gradient checkpointing, use mixed precision (bf16) |
|
|
367
|
+
| Image build fails | Pin versions, use `uv_pip_install`, use multi-stage builds |
|
|
368
|
+
| Timeout errors | Increase `timeout`, add checkpointing for long jobs |
|
|
369
|
+
| Volume changes lost | Call `volume.commit()` after writes |
|
|
370
|
+
| Stale volume data | Call `volume.reload()` before reads |
|
|
371
|
+
| Cron not firing | Cron is always UTC, must `modal deploy` (not `modal run`) |
|
|
372
|
+
| 502 on endpoint | Increase timeout, check memory, use streaming for long responses |
|
|
373
|
+
| Credentials fail | Verify `MODAL_TOKEN_ID`/`MODAL_TOKEN_SECRET` env vars are set |
|
|
374
|
+
|
|
375
|
+
## Implementation Workflow
|
|
376
|
+
|
|
377
|
+
When implementing a Modal workload:
|
|
378
|
+
1. Check the example catalog above to find the closest matching example
|
|
379
|
+
2. Load the [Examples Catalog](references/examples-catalog.md) for expanded implementation notes
|
|
380
|
+
3. Refer to Modal's docs at https://modal.com/docs/examples for full source code
|
|
381
|
+
4. Adapt for your use case using synsc credentials (`MODAL_TOKEN_ID`/`MODAL_TOKEN_SECRET`)
|
|
382
|
+
5. After job completes, report usage via `SynSci.reportUsage()` with `service="modal"`
|