opencode-skills-antigravity 1.0.39 → 1.0.41
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +10 -1
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/security-findings-triage-2026-03-29-refresh.csv +34 -0
- package/bundled-skills/docs/maintainers/security-findings-triage-2026-03-29-refresh.md +2 -0
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/sources/sources.md +2 -2
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/hugging-face-cli/SKILL.md +192 -195
- package/bundled-skills/hugging-face-community-evals/SKILL.md +213 -0
- package/bundled-skills/hugging-face-community-evals/examples/.env.example +3 -0
- package/bundled-skills/hugging-face-community-evals/examples/USAGE_EXAMPLES.md +101 -0
- package/bundled-skills/hugging-face-community-evals/scripts/inspect_eval_uv.py +104 -0
- package/bundled-skills/hugging-face-community-evals/scripts/inspect_vllm_uv.py +306 -0
- package/bundled-skills/hugging-face-community-evals/scripts/lighteval_vllm_uv.py +297 -0
- package/bundled-skills/hugging-face-dataset-viewer/SKILL.md +120 -120
- package/bundled-skills/hugging-face-gradio/SKILL.md +304 -0
- package/bundled-skills/hugging-face-gradio/examples.md +613 -0
- package/bundled-skills/hugging-face-jobs/SKILL.md +25 -18
- package/bundled-skills/hugging-face-jobs/index.html +216 -0
- package/bundled-skills/hugging-face-jobs/references/hardware_guide.md +336 -0
- package/bundled-skills/hugging-face-jobs/references/hub_saving.md +352 -0
- package/bundled-skills/hugging-face-jobs/references/token_usage.md +570 -0
- package/bundled-skills/hugging-face-jobs/references/troubleshooting.md +475 -0
- package/bundled-skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
- package/bundled-skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
- package/bundled-skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
- package/bundled-skills/hugging-face-model-trainer/SKILL.md +11 -12
- package/bundled-skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- package/bundled-skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- package/bundled-skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- package/bundled-skills/hugging-face-model-trainer/references/local_training_macos.md +231 -0
- package/bundled-skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- package/bundled-skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- package/bundled-skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- package/bundled-skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- package/bundled-skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
- package/bundled-skills/hugging-face-model-trainer/references/unsloth.md +313 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
- package/bundled-skills/hugging-face-model-trainer/scripts/unsloth_sft_example.py +512 -0
- package/bundled-skills/hugging-face-paper-publisher/SKILL.md +11 -4
- package/bundled-skills/hugging-face-paper-publisher/examples/example_usage.md +326 -0
- package/bundled-skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
- package/bundled-skills/hugging-face-paper-publisher/scripts/paper_manager.py +606 -0
- package/bundled-skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
- package/bundled-skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
- package/bundled-skills/hugging-face-paper-publisher/templates/modern.md +319 -0
- package/bundled-skills/hugging-face-paper-publisher/templates/standard.md +201 -0
- package/bundled-skills/hugging-face-papers/SKILL.md +241 -0
- package/bundled-skills/hugging-face-trackio/.claude-plugin/plugin.json +19 -0
- package/bundled-skills/hugging-face-trackio/SKILL.md +117 -0
- package/bundled-skills/hugging-face-trackio/references/alerts.md +196 -0
- package/bundled-skills/hugging-face-trackio/references/logging_metrics.md +206 -0
- package/bundled-skills/hugging-face-trackio/references/retrieving_metrics.md +251 -0
- package/bundled-skills/hugging-face-vision-trainer/SKILL.md +595 -0
- package/bundled-skills/hugging-face-vision-trainer/references/finetune_sam2_trainer.md +254 -0
- package/bundled-skills/hugging-face-vision-trainer/references/hub_saving.md +618 -0
- package/bundled-skills/hugging-face-vision-trainer/references/image_classification_training_notebook.md +279 -0
- package/bundled-skills/hugging-face-vision-trainer/references/object_detection_training_notebook.md +700 -0
- package/bundled-skills/hugging-face-vision-trainer/references/reliability_principles.md +310 -0
- package/bundled-skills/hugging-face-vision-trainer/references/timm_trainer.md +91 -0
- package/bundled-skills/hugging-face-vision-trainer/scripts/dataset_inspector.py +814 -0
- package/bundled-skills/hugging-face-vision-trainer/scripts/estimate_cost.py +217 -0
- package/bundled-skills/hugging-face-vision-trainer/scripts/image_classification_training.py +383 -0
- package/bundled-skills/hugging-face-vision-trainer/scripts/object_detection_training.py +710 -0
- package/bundled-skills/hugging-face-vision-trainer/scripts/sam_segmentation_training.py +382 -0
- package/bundled-skills/jq/SKILL.md +273 -0
- package/bundled-skills/odoo-edi-connector/SKILL.md +32 -10
- package/bundled-skills/odoo-woocommerce-bridge/SKILL.md +9 -5
- package/bundled-skills/tmux/SKILL.md +370 -0
- package/bundled-skills/transformers-js/SKILL.md +639 -0
- package/bundled-skills/transformers-js/references/CACHE.md +339 -0
- package/bundled-skills/transformers-js/references/CONFIGURATION.md +390 -0
- package/bundled-skills/transformers-js/references/EXAMPLES.md +605 -0
- package/bundled-skills/transformers-js/references/MODEL_ARCHITECTURES.md +167 -0
- package/bundled-skills/transformers-js/references/PIPELINE_OPTIONS.md +545 -0
- package/bundled-skills/transformers-js/references/TEXT_GENERATION.md +315 -0
- package/bundled-skills/viboscope/SKILL.md +64 -0
- package/package.json +1 -1
|
@@ -0,0 +1,364 @@
|
|
|
1
|
+
# Saving Training Results to Hugging Face Hub
|
|
2
|
+
|
|
3
|
+
**⚠️ CRITICAL:** Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.
|
|
4
|
+
|
|
5
|
+
## Why Hub Push is Required
|
|
6
|
+
|
|
7
|
+
When running on Hugging Face Jobs:
|
|
8
|
+
- Environment is temporary
|
|
9
|
+
- All files deleted on job completion
|
|
10
|
+
- No local disk persistence
|
|
11
|
+
- Cannot access results after job ends
|
|
12
|
+
|
|
13
|
+
**Without Hub push, training is completely wasted.**
|
|
14
|
+
|
|
15
|
+
## Required Configuration
|
|
16
|
+
|
|
17
|
+
### 1. Training Configuration
|
|
18
|
+
|
|
19
|
+
In your SFTConfig or trainer config:
|
|
20
|
+
|
|
21
|
+
```python
|
|
22
|
+
SFTConfig(
|
|
23
|
+
push_to_hub=True, # Enable Hub push
|
|
24
|
+
hub_model_id="username/model-name", # Target repository
|
|
25
|
+
)
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### 2. Job Configuration
|
|
29
|
+
|
|
30
|
+
When submitting the job:
|
|
31
|
+
|
|
32
|
+
```python
|
|
33
|
+
hf_jobs("uv", {
|
|
34
|
+
"script": "train.py",
|
|
35
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
|
|
36
|
+
})
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
**The `$HF_TOKEN` placeholder is automatically replaced with your Hugging Face token.**
|
|
40
|
+
|
|
41
|
+
## Complete Example
|
|
42
|
+
|
|
43
|
+
```python
|
|
44
|
+
# train.py
|
|
45
|
+
# /// script
|
|
46
|
+
# dependencies = ["trl"]
|
|
47
|
+
# ///
|
|
48
|
+
|
|
49
|
+
from trl import SFTTrainer, SFTConfig
|
|
50
|
+
from datasets import load_dataset
|
|
51
|
+
|
|
52
|
+
dataset = load_dataset("trl-lib/Capybara", split="train")
|
|
53
|
+
|
|
54
|
+
# Configure with Hub push
|
|
55
|
+
config = SFTConfig(
|
|
56
|
+
output_dir="my-model",
|
|
57
|
+
num_train_epochs=3,
|
|
58
|
+
|
|
59
|
+
# ✅ CRITICAL: Hub push configuration
|
|
60
|
+
push_to_hub=True,
|
|
61
|
+
hub_model_id="myusername/my-trained-model",
|
|
62
|
+
|
|
63
|
+
# Optional: Push strategy
|
|
64
|
+
push_to_hub_model_id="myusername/my-trained-model",
|
|
65
|
+
push_to_hub_organization=None,
|
|
66
|
+
push_to_hub_token=None, # Uses environment token
|
|
67
|
+
)
|
|
68
|
+
|
|
69
|
+
trainer = SFTTrainer(
|
|
70
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
71
|
+
train_dataset=dataset,
|
|
72
|
+
args=config,
|
|
73
|
+
)
|
|
74
|
+
|
|
75
|
+
trainer.train()
|
|
76
|
+
|
|
77
|
+
# ✅ Push final model
|
|
78
|
+
trainer.push_to_hub()
|
|
79
|
+
|
|
80
|
+
print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
**Submit with authentication:**
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
hf_jobs("uv", {
|
|
87
|
+
"script": "train.py",
|
|
88
|
+
"flavor": "a10g-large",
|
|
89
|
+
"timeout": "2h",
|
|
90
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required!
|
|
91
|
+
})
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
## What Gets Saved
|
|
95
|
+
|
|
96
|
+
When `push_to_hub=True`:
|
|
97
|
+
|
|
98
|
+
1. **Model weights** - Final trained parameters
|
|
99
|
+
2. **Tokenizer** - Associated tokenizer
|
|
100
|
+
3. **Configuration** - Model config (config.json)
|
|
101
|
+
4. **Training arguments** - Hyperparameters used
|
|
102
|
+
5. **Model card** - Auto-generated documentation
|
|
103
|
+
6. **Checkpoints** - If `save_strategy="steps"` enabled
|
|
104
|
+
|
|
105
|
+
## Checkpoint Saving
|
|
106
|
+
|
|
107
|
+
Save intermediate checkpoints during training:
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
SFTConfig(
|
|
111
|
+
output_dir="my-model",
|
|
112
|
+
push_to_hub=True,
|
|
113
|
+
hub_model_id="username/my-model",
|
|
114
|
+
|
|
115
|
+
# Checkpoint configuration
|
|
116
|
+
save_strategy="steps",
|
|
117
|
+
save_steps=100, # Save every 100 steps
|
|
118
|
+
save_total_limit=3, # Keep only last 3 checkpoints
|
|
119
|
+
)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
**Benefits:**
|
|
123
|
+
- Resume training if job fails
|
|
124
|
+
- Compare checkpoint performance
|
|
125
|
+
- Use intermediate models
|
|
126
|
+
|
|
127
|
+
**Checkpoints are pushed to:** `username/my-model` (same repo)
|
|
128
|
+
|
|
129
|
+
## Authentication Methods
|
|
130
|
+
|
|
131
|
+
### Method 1: Automatic Token (Recommended)
|
|
132
|
+
|
|
133
|
+
```python
|
|
134
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Uses your logged-in Hugging Face token automatically.
|
|
138
|
+
|
|
139
|
+
### Method 2: Explicit Token
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
"secrets": {"HF_TOKEN": "hf_abc123..."}
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Provide token explicitly (not recommended for security).
|
|
146
|
+
|
|
147
|
+
### Method 3: Environment Variable
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
"env": {"HF_TOKEN": "hf_abc123..."}
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
Pass as regular environment variable (less secure than secrets).
|
|
154
|
+
|
|
155
|
+
**Always prefer Method 1** for security and convenience.
|
|
156
|
+
|
|
157
|
+
## Verification Checklist
|
|
158
|
+
|
|
159
|
+
Before submitting any training job, verify:
|
|
160
|
+
|
|
161
|
+
- [ ] `push_to_hub=True` in training config
|
|
162
|
+
- [ ] `hub_model_id` is specified (format: `username/model-name`)
|
|
163
|
+
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
|
164
|
+
- [ ] Repository name doesn't conflict with existing repos
|
|
165
|
+
- [ ] You have write access to the target namespace
|
|
166
|
+
|
|
167
|
+
## Repository Setup
|
|
168
|
+
|
|
169
|
+
### Automatic Creation
|
|
170
|
+
|
|
171
|
+
If repository doesn't exist, it's created automatically when first pushing.
|
|
172
|
+
|
|
173
|
+
### Manual Creation
|
|
174
|
+
|
|
175
|
+
Create repository before training:
|
|
176
|
+
|
|
177
|
+
```python
|
|
178
|
+
from huggingface_hub import HfApi
|
|
179
|
+
|
|
180
|
+
api = HfApi()
|
|
181
|
+
api.create_repo(
|
|
182
|
+
repo_id="username/model-name",
|
|
183
|
+
repo_type="model",
|
|
184
|
+
private=False, # or True for private repo
|
|
185
|
+
)
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
### Repository Naming
|
|
189
|
+
|
|
190
|
+
**Valid names:**
|
|
191
|
+
- `username/my-model`
|
|
192
|
+
- `username/model-name`
|
|
193
|
+
- `organization/model-name`
|
|
194
|
+
|
|
195
|
+
**Invalid names:**
|
|
196
|
+
- `model-name` (missing username)
|
|
197
|
+
- `username/model name` (spaces not allowed)
|
|
198
|
+
- `username/MODEL` (uppercase discouraged)
|
|
199
|
+
|
|
200
|
+
## Troubleshooting
|
|
201
|
+
|
|
202
|
+
### Error: 401 Unauthorized
|
|
203
|
+
|
|
204
|
+
**Cause:** HF_TOKEN not provided or invalid
|
|
205
|
+
|
|
206
|
+
**Solutions:**
|
|
207
|
+
1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
|
208
|
+
2. Check you're logged in: `hf auth whoami`
|
|
209
|
+
3. Re-login: `hf auth login`
|
|
210
|
+
|
|
211
|
+
### Error: 403 Forbidden
|
|
212
|
+
|
|
213
|
+
**Cause:** No write access to repository
|
|
214
|
+
|
|
215
|
+
**Solutions:**
|
|
216
|
+
1. Check repository namespace matches your username
|
|
217
|
+
2. Verify you're a member of organization (if using org namespace)
|
|
218
|
+
3. Check repository isn't private (if accessing org repo)
|
|
219
|
+
|
|
220
|
+
### Error: Repository not found
|
|
221
|
+
|
|
222
|
+
**Cause:** Repository doesn't exist and auto-creation failed
|
|
223
|
+
|
|
224
|
+
**Solutions:**
|
|
225
|
+
1. Manually create repository first
|
|
226
|
+
2. Check repository name format
|
|
227
|
+
3. Verify namespace exists
|
|
228
|
+
|
|
229
|
+
### Error: Push failed during training
|
|
230
|
+
|
|
231
|
+
**Cause:** Network issues or Hub unavailable
|
|
232
|
+
|
|
233
|
+
**Solutions:**
|
|
234
|
+
1. Training continues but final push fails
|
|
235
|
+
2. Checkpoints may be saved
|
|
236
|
+
3. Re-run push manually after job completes
|
|
237
|
+
|
|
238
|
+
### Issue: Model saved but not visible
|
|
239
|
+
|
|
240
|
+
**Possible causes:**
|
|
241
|
+
1. Repository is private—check https://huggingface.co/username
|
|
242
|
+
2. Wrong namespace—verify `hub_model_id` matches login
|
|
243
|
+
3. Push still in progress—wait a few minutes
|
|
244
|
+
|
|
245
|
+
## Manual Push After Training
|
|
246
|
+
|
|
247
|
+
If training completes but push fails, push manually:
|
|
248
|
+
|
|
249
|
+
```python
|
|
250
|
+
from transformers import AutoModel, AutoTokenizer
|
|
251
|
+
|
|
252
|
+
# Load from local checkpoint
|
|
253
|
+
model = AutoModel.from_pretrained("./output_dir")
|
|
254
|
+
tokenizer = AutoTokenizer.from_pretrained("./output_dir")
|
|
255
|
+
|
|
256
|
+
# Push to Hub
|
|
257
|
+
model.push_to_hub("username/model-name", token="hf_abc123...")
|
|
258
|
+
tokenizer.push_to_hub("username/model-name", token="hf_abc123...")
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
**Note:** Only possible if job hasn't completed (files still exist).
|
|
262
|
+
|
|
263
|
+
## Best Practices
|
|
264
|
+
|
|
265
|
+
1. **Always enable `push_to_hub=True`**
|
|
266
|
+
2. **Use checkpoint saving** for long training runs
|
|
267
|
+
3. **Verify Hub push** in logs before job completes
|
|
268
|
+
4. **Set appropriate `save_total_limit`** to avoid excessive checkpoints
|
|
269
|
+
5. **Use descriptive repo names** (e.g., `qwen-capybara-sft` not `model1`)
|
|
270
|
+
6. **Add model card** with training details
|
|
271
|
+
7. **Tag models** with relevant tags (e.g., `text-generation`, `fine-tuned`)
|
|
272
|
+
|
|
273
|
+
## Monitoring Push Progress
|
|
274
|
+
|
|
275
|
+
Check logs for push progress:
|
|
276
|
+
|
|
277
|
+
```python
|
|
278
|
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
**Look for:**
|
|
282
|
+
```
|
|
283
|
+
Pushing model to username/model-name...
|
|
284
|
+
Upload file pytorch_model.bin: 100%
|
|
285
|
+
✅ Model pushed successfully
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
## Example: Full Production Setup
|
|
289
|
+
|
|
290
|
+
```python
|
|
291
|
+
# production_train.py
|
|
292
|
+
# /// script
|
|
293
|
+
# dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
|
|
294
|
+
# ///
|
|
295
|
+
|
|
296
|
+
from datasets import load_dataset
|
|
297
|
+
from peft import LoraConfig
|
|
298
|
+
from trl import SFTTrainer, SFTConfig
|
|
299
|
+
import os
|
|
300
|
+
|
|
301
|
+
# Verify token is available
|
|
302
|
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"
|
|
303
|
+
|
|
304
|
+
# Load dataset
|
|
305
|
+
dataset = load_dataset("trl-lib/Capybara", split="train")
|
|
306
|
+
print(f"✅ Dataset loaded: {len(dataset)} examples")
|
|
307
|
+
|
|
308
|
+
# Configure with comprehensive Hub settings
|
|
309
|
+
config = SFTConfig(
|
|
310
|
+
output_dir="qwen-capybara-sft",
|
|
311
|
+
|
|
312
|
+
# Hub configuration
|
|
313
|
+
push_to_hub=True,
|
|
314
|
+
hub_model_id="myusername/qwen-capybara-sft",
|
|
315
|
+
hub_strategy="checkpoint", # Push checkpoints
|
|
316
|
+
|
|
317
|
+
# Checkpoint configuration
|
|
318
|
+
save_strategy="steps",
|
|
319
|
+
save_steps=100,
|
|
320
|
+
save_total_limit=3,
|
|
321
|
+
|
|
322
|
+
# Training settings
|
|
323
|
+
num_train_epochs=3,
|
|
324
|
+
per_device_train_batch_size=4,
|
|
325
|
+
|
|
326
|
+
# Logging
|
|
327
|
+
logging_steps=10,
|
|
328
|
+
logging_first_step=True,
|
|
329
|
+
)
|
|
330
|
+
|
|
331
|
+
# Train with LoRA
|
|
332
|
+
trainer = SFTTrainer(
|
|
333
|
+
model="Qwen/Qwen2.5-0.5B",
|
|
334
|
+
train_dataset=dataset,
|
|
335
|
+
args=config,
|
|
336
|
+
peft_config=LoraConfig(r=16, lora_alpha=32),
|
|
337
|
+
)
|
|
338
|
+
|
|
339
|
+
print("🚀 Starting training...")
|
|
340
|
+
trainer.train()
|
|
341
|
+
|
|
342
|
+
print("💾 Pushing final model to Hub...")
|
|
343
|
+
trainer.push_to_hub()
|
|
344
|
+
|
|
345
|
+
print("✅ Training complete!")
|
|
346
|
+
print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
**Submit:**
|
|
350
|
+
|
|
351
|
+
```python
|
|
352
|
+
hf_jobs("uv", {
|
|
353
|
+
"script": "production_train.py",
|
|
354
|
+
"flavor": "a10g-large",
|
|
355
|
+
"timeout": "6h",
|
|
356
|
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
|
357
|
+
})
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
## Key Takeaway
|
|
361
|
+
|
|
362
|
+
**Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.**
|
|
363
|
+
|
|
364
|
+
Always verify both are configured before submitting any training job.
|
|
@@ -0,0 +1,231 @@
|
|
|
1
|
+
# Local Training on macOS (Apple Silicon)
|
|
2
|
+
|
|
3
|
+
Run small LoRA fine-tuning jobs locally on Mac for smoke tests and quick iteration before submitting to HF Jobs.
|
|
4
|
+
|
|
5
|
+
## When to Use Local Mac vs HF Jobs
|
|
6
|
+
|
|
7
|
+
| Local Mac | HF Jobs / Cloud GPU |
|
|
8
|
+
|-----------|-------------------|
|
|
9
|
+
| Model ≤3B, text-only | Model 7B+ |
|
|
10
|
+
| LoRA/PEFT only | QLoRA 4-bit (CUDA/bitsandbytes) |
|
|
11
|
+
| Short context (≤1024) | Long context / full fine-tuning |
|
|
12
|
+
| Smoke tests, dataset validation | Production runs, VLMs |
|
|
13
|
+
|
|
14
|
+
**Typical workflow:** local smoke test → HF Jobs with same config → export/quantize ([gguf_conversion.md](gguf_conversion.md))
|
|
15
|
+
|
|
16
|
+
## Recommended Defaults
|
|
17
|
+
|
|
18
|
+
| Setting | Value | Notes |
|
|
19
|
+
|---------|-------|-------|
|
|
20
|
+
| Model size | 0.5B–1.5B first run | Scale up after verifying |
|
|
21
|
+
| Max seq length | 512–1024 | Lower = less memory |
|
|
22
|
+
| Batch size | 1 | Scale via gradient accumulation |
|
|
23
|
+
| Gradient accumulation | 8–16 | Effective batch = 8–16 |
|
|
24
|
+
| LoRA rank (r) | 8–16 | alpha = 2×r |
|
|
25
|
+
| Dtype | float32 | fp16 causes NaN on MPS; bf16 only on M1 Pro+ and M2/M3/M4 |
|
|
26
|
+
|
|
27
|
+
### Memory by hardware
|
|
28
|
+
|
|
29
|
+
| Unified RAM | Max Model Size |
|
|
30
|
+
|-------------|---------------|
|
|
31
|
+
| 16 GB | ~0.5B–1.5B |
|
|
32
|
+
| 32 GB | ~1.5B–3B |
|
|
33
|
+
| 64 GB | ~3B (short context) |
|
|
34
|
+
|
|
35
|
+
## Setup
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
xcode-select --install
|
|
39
|
+
python3 -m venv .venv && source .venv/bin/activate
|
|
40
|
+
pip install -U "torch>=2.2" "transformers>=4.40" "trl>=0.12" "peft>=0.10" \
|
|
41
|
+
datasets accelerate safetensors huggingface_hub
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Verify MPS:
|
|
45
|
+
```bash
|
|
46
|
+
python -c "import torch; print(torch.__version__, '| MPS:', torch.backends.mps.is_available())"
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Optional — configure Accelerate for local Mac (no distributed, no mixed precision, MPS device):
|
|
50
|
+
```bash
|
|
51
|
+
accelerate config
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## Training Script
|
|
55
|
+
|
|
56
|
+
<details>
|
|
57
|
+
<summary><strong>train_lora_sft.py</strong></summary>
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
import os
|
|
61
|
+
from dataclasses import dataclass
|
|
62
|
+
from typing import Optional
|
|
63
|
+
import torch
|
|
64
|
+
from datasets import load_dataset
|
|
65
|
+
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
|
|
66
|
+
from peft import LoraConfig
|
|
67
|
+
from trl import SFTTrainer, SFTConfig
|
|
68
|
+
|
|
69
|
+
set_seed(42)
|
|
70
|
+
|
|
71
|
+
@dataclass
|
|
72
|
+
class Cfg:
|
|
73
|
+
model_id: str = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
|
|
74
|
+
dataset_id: str = os.environ.get("DATASET_ID", "HuggingFaceH4/ultrachat_200k")
|
|
75
|
+
dataset_split: str = os.environ.get("DATASET_SPLIT", "train_sft[:500]")
|
|
76
|
+
data_files: Optional[str] = os.environ.get("DATA_FILES", None)
|
|
77
|
+
text_field: str = os.environ.get("TEXT_FIELD", "")
|
|
78
|
+
messages_field: str = os.environ.get("MESSAGES_FIELD", "messages")
|
|
79
|
+
out_dir: str = os.environ.get("OUT_DIR", "outputs/local-lora")
|
|
80
|
+
max_seq_length: int = int(os.environ.get("MAX_SEQ_LENGTH", "512"))
|
|
81
|
+
max_steps: int = int(os.environ.get("MAX_STEPS", "-1"))
|
|
82
|
+
|
|
83
|
+
cfg = Cfg()
|
|
84
|
+
device = "mps" if torch.backends.mps.is_available() else "cpu"
|
|
85
|
+
|
|
86
|
+
tokenizer = AutoTokenizer.from_pretrained(cfg.model_id, use_fast=True)
|
|
87
|
+
if tokenizer.pad_token is None:
|
|
88
|
+
tokenizer.pad_token = tokenizer.eos_token
|
|
89
|
+
tokenizer.padding_side = "right"
|
|
90
|
+
|
|
91
|
+
model = AutoModelForCausalLM.from_pretrained(cfg.model_id, torch_dtype=torch.float32)
|
|
92
|
+
model.to(device)
|
|
93
|
+
model.config.use_cache = False
|
|
94
|
+
|
|
95
|
+
if cfg.data_files:
|
|
96
|
+
ds = load_dataset("json", data_files=cfg.data_files, split="train")
|
|
97
|
+
else:
|
|
98
|
+
ds = load_dataset(cfg.dataset_id, split=cfg.dataset_split)
|
|
99
|
+
|
|
100
|
+
def format_example(ex):
|
|
101
|
+
if cfg.text_field and isinstance(ex.get(cfg.text_field), str):
|
|
102
|
+
ex["text"] = ex[cfg.text_field]
|
|
103
|
+
return ex
|
|
104
|
+
msgs = ex.get(cfg.messages_field)
|
|
105
|
+
if isinstance(msgs, list):
|
|
106
|
+
if hasattr(tokenizer, "apply_chat_template"):
|
|
107
|
+
try:
|
|
108
|
+
ex["text"] = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
|
|
109
|
+
return ex
|
|
110
|
+
except Exception:
|
|
111
|
+
pass
|
|
112
|
+
ex["text"] = "\n".join([str(m) for m in msgs])
|
|
113
|
+
return ex
|
|
114
|
+
ex["text"] = str(ex)
|
|
115
|
+
return ex
|
|
116
|
+
|
|
117
|
+
ds = ds.map(format_example)
|
|
118
|
+
ds = ds.remove_columns([c for c in ds.column_names if c != "text"])
|
|
119
|
+
|
|
120
|
+
lora = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
|
|
121
|
+
task_type="CAUSAL_LM", target_modules=["q_proj", "k_proj", "v_proj", "o_proj"])
|
|
122
|
+
|
|
123
|
+
sft_kwargs = dict(
|
|
124
|
+
output_dir=cfg.out_dir, per_device_train_batch_size=1, gradient_accumulation_steps=8,
|
|
125
|
+
learning_rate=2e-4, logging_steps=10, save_steps=200, save_total_limit=2,
|
|
126
|
+
gradient_checkpointing=True, report_to="none", fp16=False, bf16=False,
|
|
127
|
+
max_seq_length=cfg.max_seq_length, dataset_text_field="text",
|
|
128
|
+
)
|
|
129
|
+
if cfg.max_steps > 0:
|
|
130
|
+
sft_kwargs["max_steps"] = cfg.max_steps
|
|
131
|
+
else:
|
|
132
|
+
sft_kwargs["num_train_epochs"] = 1
|
|
133
|
+
|
|
134
|
+
trainer = SFTTrainer(model=model, train_dataset=ds, peft_config=lora,
|
|
135
|
+
args=SFTConfig(**sft_kwargs), processing_class=tokenizer)
|
|
136
|
+
trainer.train()
|
|
137
|
+
trainer.save_model(cfg.out_dir)
|
|
138
|
+
print(f"✅ Saved to: {cfg.out_dir}")
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
</details>
|
|
142
|
+
|
|
143
|
+
### Run
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
python train_lora_sft.py
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
**Env overrides:**
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct" python train_lora_sft.py # different model
|
|
153
|
+
MAX_STEPS=50 python train_lora_sft.py # quick 50-step test
|
|
154
|
+
DATA_FILES="my_data.jsonl" python train_lora_sft.py # local JSONL file
|
|
155
|
+
PYTORCH_ENABLE_MPS_FALLBACK=1 python train_lora_sft.py # MPS op fallback to CPU
|
|
156
|
+
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 python train_lora_sft.py # disable MPS memory limit (use with caution)
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
**Local JSONL format** — chat messages or plain text:
|
|
160
|
+
```jsonl
|
|
161
|
+
{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}
|
|
162
|
+
```
|
|
163
|
+
```jsonl
|
|
164
|
+
{"text": "User: Hello\nAssistant: Hi!"}
|
|
165
|
+
```
|
|
166
|
+
For plain text: `DATA_FILES="file.jsonl" TEXT_FIELD="text" MESSAGES_FIELD="" python train_lora_sft.py`
|
|
167
|
+
|
|
168
|
+
### Verify Success
|
|
169
|
+
|
|
170
|
+
- Loss decreases over steps
|
|
171
|
+
- `outputs/local-lora/` contains `adapter_config.json` + `*.safetensors`
|
|
172
|
+
|
|
173
|
+
## Quick Evaluation
|
|
174
|
+
|
|
175
|
+
<details>
|
|
176
|
+
<summary><strong>eval_generate.py</strong></summary>
|
|
177
|
+
|
|
178
|
+
```python
|
|
179
|
+
import os, torch
|
|
180
|
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
181
|
+
from peft import PeftModel
|
|
182
|
+
|
|
183
|
+
BASE = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
|
|
184
|
+
ADAPTER = os.environ.get("ADAPTER_DIR", "outputs/local-lora")
|
|
185
|
+
device = "mps" if torch.backends.mps.is_available() else "cpu"
|
|
186
|
+
|
|
187
|
+
tokenizer = AutoTokenizer.from_pretrained(BASE, use_fast=True)
|
|
188
|
+
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float32)
|
|
189
|
+
model.to(device)
|
|
190
|
+
model = PeftModel.from_pretrained(model, ADAPTER)
|
|
191
|
+
|
|
192
|
+
prompt = os.environ.get("PROMPT", "Explain gradient accumulation in 3 bullet points.")
|
|
193
|
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
|
194
|
+
with torch.no_grad():
|
|
195
|
+
out = model.generate(**inputs, max_new_tokens=120, do_sample=True, temperature=0.7, top_p=0.9)
|
|
196
|
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
</details>
|
|
200
|
+
|
|
201
|
+
## Troubleshooting (macOS-Specific)
|
|
202
|
+
|
|
203
|
+
For general training issues, see [troubleshooting.md](troubleshooting.md).
|
|
204
|
+
|
|
205
|
+
| Problem | Fix |
|
|
206
|
+
|---------|-----|
|
|
207
|
+
| MPS unsupported op / crash | `PYTORCH_ENABLE_MPS_FALLBACK=1` |
|
|
208
|
+
| OOM / system instability | Reduce `MAX_SEQ_LENGTH`, use smaller model, set `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` (caution) |
|
|
209
|
+
| fp16 NaN / loss explosion | Keep `fp16=False` (default), lower learning rate |
|
|
210
|
+
| LoRA "module not found" | Print `model.named_modules()` to find correct target names |
|
|
211
|
+
| TRL TypeError on args | Check TRL version; script uses `SFTConfig` + `processing_class` (TRL ≥0.12) |
|
|
212
|
+
| Intel Mac | No MPS — use HF Jobs instead |
|
|
213
|
+
|
|
214
|
+
**Common LoRA target modules by architecture:**
|
|
215
|
+
|
|
216
|
+
| Architecture | target_modules |
|
|
217
|
+
|-------------|---------------|
|
|
218
|
+
| Llama/Qwen/Mistral | `q_proj`, `k_proj`, `v_proj`, `o_proj` |
|
|
219
|
+
| GPT-2/GPT-J | `c_attn`, `c_proj` |
|
|
220
|
+
| BLOOM | `query_key_value`, `dense` |
|
|
221
|
+
|
|
222
|
+
## MLX Alternative
|
|
223
|
+
|
|
224
|
+
[MLX](https://github.com/ml-explore/mlx) offers tighter Apple Silicon integration but has a smaller ecosystem and less mature training APIs. For this skill's workflow (local validation → HF Jobs), PyTorch + MPS is recommended for consistency. See [mlx-lm](https://github.com/ml-explore/mlx-lm) for MLX-based fine-tuning.
|
|
225
|
+
|
|
226
|
+
## See Also
|
|
227
|
+
|
|
228
|
+
- [troubleshooting.md](troubleshooting.md) — General TRL troubleshooting
|
|
229
|
+
- [hardware_guide.md](hardware_guide.md) — GPU selection for HF Jobs
|
|
230
|
+
- [gguf_conversion.md](gguf_conversion.md) — Export for on-device inference
|
|
231
|
+
- [training_methods.md](training_methods.md) — SFT, DPO, GRPO overview
|