rapidfireai 0.10.2rc5__py3-none-any.whl → 0.11.1rc1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of rapidfireai might be problematic. Click here for more details.

Files changed (36) hide show
  1. rapidfireai/automl/grid_search.py +4 -5
  2. rapidfireai/automl/model_config.py +41 -37
  3. rapidfireai/automl/random_search.py +21 -33
  4. rapidfireai/backend/controller.py +80 -161
  5. rapidfireai/backend/worker.py +26 -8
  6. rapidfireai/cli.py +171 -132
  7. rapidfireai/db/rf_db.py +1 -1
  8. rapidfireai/db/tables.sql +1 -1
  9. rapidfireai/dispatcher/dispatcher.py +3 -1
  10. rapidfireai/dispatcher/gunicorn.conf.py +1 -1
  11. rapidfireai/experiment.py +86 -7
  12. rapidfireai/frontend/build/asset-manifest.json +3 -3
  13. rapidfireai/frontend/build/index.html +1 -1
  14. rapidfireai/frontend/build/static/js/{main.1bf27639.js → main.58393d31.js} +3 -3
  15. rapidfireai/frontend/build/static/js/{main.1bf27639.js.map → main.58393d31.js.map} +1 -1
  16. rapidfireai/frontend/proxy_middleware.py +1 -1
  17. rapidfireai/ml/callbacks.py +85 -59
  18. rapidfireai/ml/trainer.py +42 -86
  19. rapidfireai/start.sh +117 -34
  20. rapidfireai/utils/constants.py +22 -1
  21. rapidfireai/utils/experiment_utils.py +87 -43
  22. rapidfireai/utils/interactive_controller.py +473 -0
  23. rapidfireai/utils/logging.py +1 -2
  24. rapidfireai/utils/metric_logger.py +346 -0
  25. rapidfireai/utils/mlflow_manager.py +0 -1
  26. rapidfireai/utils/ping.py +4 -2
  27. rapidfireai/utils/worker_manager.py +16 -6
  28. rapidfireai/version.py +2 -2
  29. {rapidfireai-0.10.2rc5.dist-info → rapidfireai-0.11.1rc1.dist-info}/METADATA +7 -4
  30. {rapidfireai-0.10.2rc5.dist-info → rapidfireai-0.11.1rc1.dist-info}/RECORD +36 -33
  31. tutorial_notebooks/rf-colab-tensorboard-tutorial.ipynb +314 -0
  32. /rapidfireai/frontend/build/static/js/{main.1bf27639.js.LICENSE.txt → main.58393d31.js.LICENSE.txt} +0 -0
  33. {rapidfireai-0.10.2rc5.dist-info → rapidfireai-0.11.1rc1.dist-info}/WHEEL +0 -0
  34. {rapidfireai-0.10.2rc5.dist-info → rapidfireai-0.11.1rc1.dist-info}/entry_points.txt +0 -0
  35. {rapidfireai-0.10.2rc5.dist-info → rapidfireai-0.11.1rc1.dist-info}/licenses/LICENSE +0 -0
  36. {rapidfireai-0.10.2rc5.dist-info → rapidfireai-0.11.1rc1.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,314 @@
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# RapidFire AI with TensorBoard in Google Colab\n",
8
+ "\n",
9
+ "This tutorial demonstrates how to use RapidFire AI with TensorBoard for real-time metrics visualization in Google Colab.\n",
10
+ "\n",
11
+ "## Why TensorBoard in Colab?\n",
12
+ "\n",
13
+ "- **Real-time visualization**: View training metrics as they happen\n",
14
+ "- **No frontend loading delay**: TensorBoard loads instantly in Colab\n",
15
+ "- **Native Colab support**: TensorBoard works natively with `%tensorboard` magic\n",
16
+ "- **Live updates**: Metrics update every 30 seconds while training cell is blocked\n",
17
+ "\n",
18
+ "## Setup\n",
19
+ "\n",
20
+ "First, let's install RapidFire AI and load the TensorBoard extension:"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "metadata": {},
27
+ "outputs": [],
28
+ "source": [
29
+ "# Install RapidFire AI\n",
30
+ "!pip install rapidfireai\n",
31
+ "\n",
32
+ "# Load TensorBoard extension\n",
33
+ "%load_ext tensorboard"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "markdown",
38
+ "metadata": {},
39
+ "source": [
40
+ "## Configure RapidFire to Use TensorBoard\n",
41
+ "\n",
42
+ "We'll set environment variables to tell RapidFire to use TensorBoard instead of MLflow:"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "code",
47
+ "execution_count": null,
48
+ "metadata": {},
49
+ "outputs": [],
50
+ "source": [
51
+ "import os\n",
52
+ "\n",
53
+ "# Configure RapidFire to use TensorBoard\n",
54
+ "os.environ['RF_TRACKING_BACKEND'] = 'tensorboard' # Options: 'mlflow', 'tensorboard', 'both'\n",
55
+ "# TensorBoard log directory will be auto-created in experiment path"
56
+ ]
57
+ },
58
+ {
59
+ "cell_type": "markdown",
60
+ "source": "## Start RapidFire Services in Colab Mode\n\n**IMPORTANT**: RapidFire requires the dispatcher service to manage experiment state. Open the Colab terminal (Tools > Command palette > Terminal) and run:\n\n```bash\nexport RF_TRACKING_BACKEND=tensorboard\nrapidfireai start --colab\n```\n\nThe `--colab` flag will:\n- ✅ Start the dispatcher service (required for experiment state management)\n- ⊗ Skip the frontend server (using TensorBoard instead)\n- ⊗ Skip MLflow when using TensorBoard-only tracking (conditional)\n\nYou should see output like:\n```\n📦 RapidFire AI Initializing...\n✅ [1/1] Dispatcher server started\n🚀 RapidFire running in Colab mode!\n📊 Use TensorBoard for metrics visualization:\n %tensorboard --logdir ~/experiments/{experiment_name}/tensorboard_logs\n```\n\n**Note**: If you want to use both TensorBoard and MLflow, set `RF_TRACKING_BACKEND=both` and the MLflow service will also start.\n\nLeave this terminal running while you work in your notebook!",
61
+ "metadata": {}
62
+ },
63
+ {
64
+ "cell_type": "markdown",
65
+ "metadata": {},
66
+ "source": [
67
+ "## Import RapidFire Components"
68
+ ]
69
+ },
70
+ {
71
+ "cell_type": "code",
72
+ "execution_count": null,
73
+ "metadata": {},
74
+ "outputs": [],
75
+ "source": [
76
+ "from rapidfireai import Experiment\n",
77
+ "from rapidfireai.automl import List, RFGridSearch, RFModelConfig, RFLoraConfig, RFSFTConfig"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "markdown",
82
+ "metadata": {},
83
+ "source": [
84
+ "## Load Dataset"
85
+ ]
86
+ },
87
+ {
88
+ "cell_type": "code",
89
+ "execution_count": null,
90
+ "metadata": {},
91
+ "outputs": [],
92
+ "source": "from datasets import load_dataset\n\ndataset = load_dataset(\"bitext/Bitext-customer-support-llm-chatbot-training-dataset\")\n\n# REDUCED dataset for memory constraints in Colab\ntrain_dataset = dataset[\"train\"].select(range(64)) # Reduced from 128\neval_dataset = dataset[\"train\"].select(range(50, 60)) # 10 examples\ntrain_dataset = train_dataset.shuffle(seed=42)\neval_dataset = eval_dataset.shuffle(seed=42)"
93
+ },
94
+ {
95
+ "cell_type": "markdown",
96
+ "metadata": {},
97
+ "source": "## Define Data Processing Function\n\nWe'll format the data as Q&A pairs for GPT-2:"
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": null,
102
+ "metadata": {},
103
+ "outputs": [],
104
+ "source": "def sample_formatting_function(example):\n \"\"\"Format the dataset for GPT-2 while preserving original fields\"\"\"\n return {\n \"text\": f\"Question: {example['instruction']}\\nAnswer: {example['response']}\",\n \"instruction\": example['instruction'], # Keep original\n \"response\": example['response'] # Keep original\n }\n\n# Apply formatting to datasets\neval_dataset = eval_dataset.map(sample_formatting_function)\ntrain_dataset = train_dataset.map(sample_formatting_function)"
105
+ },
106
+ {
107
+ "cell_type": "markdown",
108
+ "source": "## Define Metrics Function\n\nWe'll use a lightweight metrics computation with just ROUGE-L to save memory:",
109
+ "metadata": {}
110
+ },
111
+ {
112
+ "cell_type": "code",
113
+ "source": "def sample_compute_metrics(eval_preds):\n \"\"\"Lightweight metrics computation\"\"\"\n predictions, labels = eval_preds\n\n try:\n import evaluate\n\n # Only compute ROUGE-L (skip BLEU to save memory)\n rouge = evaluate.load(\"rouge\")\n rouge_output = rouge.compute(\n predictions=predictions,\n references=labels,\n use_stemmer=True,\n rouge_types=[\"rougeL\"] # Only compute rougeL\n )\n\n return {\n \"rougeL\": round(rouge_output[\"rougeL\"], 4),\n }\n except Exception as e:\n # Fallback if metrics fail\n print(f\"Metrics computation failed: {e}\")\n return {}",
114
+ "metadata": {},
115
+ "execution_count": null,
116
+ "outputs": []
117
+ },
118
+ {
119
+ "cell_type": "markdown",
120
+ "metadata": {},
121
+ "source": [
122
+ "## Initialize Experiment"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "code",
127
+ "execution_count": null,
128
+ "metadata": {},
129
+ "outputs": [],
130
+ "source": [
131
+ "# Create experiment with unique name\n",
132
+ "experiment = Experiment(experiment_name=\"tensorboard-demo\")"
133
+ ]
134
+ },
135
+ {
136
+ "cell_type": "markdown",
137
+ "metadata": {},
138
+ "source": [
139
+ "## Get TensorBoard Log Directory\n",
140
+ "\n",
141
+ "The TensorBoard logs are stored in the experiment directory. Let's get the path:"
142
+ ]
143
+ },
144
+ {
145
+ "cell_type": "code",
146
+ "execution_count": null,
147
+ "metadata": {},
148
+ "outputs": [],
149
+ "source": "# Get experiment path\nfrom rapidfireai.db.rf_db import RfDb\n\ndb = RfDb()\nexperiment_name = \"tensorboard-demo\"\nexperiment_path = db.get_experiments_path(experiment_name)\ntensorboard_log_dir = f\"{experiment_path}/{experiment_name}/tensorboard_logs\"\n\nprint(f\"TensorBoard logs will be saved to: {tensorboard_log_dir}\")"
150
+ },
151
+ {
152
+ "cell_type": "markdown",
153
+ "metadata": {},
154
+ "source": [
155
+ "## Start TensorBoard\n",
156
+ "\n",
157
+ "**IMPORTANT**: Start TensorBoard BEFORE running training, so you can watch metrics update in real-time!"
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "markdown",
162
+ "metadata": {},
163
+ "source": "## Define Model Configuration\n\nWe'll use GPT-2 (124M parameters) which is perfect for Colab's memory constraints:"
164
+ },
165
+ {
166
+ "cell_type": "code",
167
+ "metadata": {},
168
+ "source": "# GPT-2 specific LoRA configs - different module names!\npeft_configs_lite = List([\n RFLoraConfig(\n r=8,\n lora_alpha=16,\n lora_dropout=0.1,\n target_modules=[\"c_attn\"], # GPT-2 combines Q,K,V in c_attn\n bias=\"none\"\n ),\n RFLoraConfig(\n r=32,\n lora_alpha=64,\n lora_dropout=0.1,\n target_modules=[\"c_attn\", \"c_proj\"], # c_attn (QKV) + c_proj (output)\n bias=\"none\"\n )\n])\n\n# 2 configs with GPT-2 (124M params)\nconfig_set_lite = List([\n RFModelConfig(\n model_name=\"gpt2\", # Only 124M params\n peft_config=peft_configs_lite,\n training_args=RFSFTConfig(\n learning_rate=5e-4, # Lower learning rate for GPT-2 stability\n lr_scheduler_type=\"linear\",\n per_device_train_batch_size=2, # Reduced for memory\n per_device_eval_batch_size=2,\n max_steps=128,\n gradient_accumulation_steps=2, # Effective batch size = 4\n logging_steps=2,\n eval_strategy=\"steps\",\n eval_steps=4,\n fp16=True,\n gradient_checkpointing=True, # Save memory\n report_to=\"none\", # Disables wandb\n ),\n model_type=\"causal_lm\",\n model_kwargs={\n \"device_map\": \"auto\",\n \"torch_dtype\": \"float16\", # Explicit fp16\n \"use_cache\": False\n },\n formatting_func=sample_formatting_function,\n compute_metrics=sample_compute_metrics,\n generation_config={\n \"max_new_tokens\": 128, # Reduced from 256\n \"temperature\": 0.7,\n \"top_p\": 0.9,\n \"top_k\": 40,\n \"repetition_penalty\": 1.1,\n \"pad_token_id\": 50256, # GPT-2's EOS token\n }\n ),\n RFModelConfig(\n model_name=\"gpt2\",\n peft_config=peft_configs_lite,\n training_args=RFSFTConfig(\n learning_rate=2e-4, # Even more conservative\n lr_scheduler_type=\"cosine\", # Try cosine schedule\n per_device_train_batch_size=2,\n per_device_eval_batch_size=2,\n max_steps=128,\n gradient_accumulation_steps=2,\n logging_steps=2,\n eval_strategy=\"steps\",\n eval_steps=4,\n fp16=True,\n gradient_checkpointing=True,\n report_to=\"none\", # Disables wandb\n warmup_steps=10, # Add warmup for stability\n ),\n model_type=\"causal_lm\",\n model_kwargs={\n \"device_map\": \"auto\",\n \"torch_dtype\": \"float16\",\n \"use_cache\": False\n },\n formatting_func=sample_formatting_function,\n compute_metrics=sample_compute_metrics,\n generation_config={\n \"max_new_tokens\": 128,\n \"temperature\": 0.7,\n \"top_p\": 0.9,\n \"top_k\": 40,\n \"repetition_penalty\": 1.1,\n \"pad_token_id\": 50256,\n }\n )\n])",
169
+ "execution_count": null,
170
+ "outputs": []
171
+ },
172
+ {
173
+ "cell_type": "code",
174
+ "metadata": {},
175
+ "source": "def sample_create_model(model_config):\n \"\"\"Function to create model object with GPT-2 adjustments\"\"\"\n from transformers import AutoModelForCausalLM, AutoTokenizer\n\n model_name = model_config[\"model_name\"]\n model_type = model_config[\"model_type\"]\n model_kwargs = model_config[\"model_kwargs\"]\n\n if model_type == \"causal_lm\":\n model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)\n else:\n # Default to causal LM\n model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)\n\n tokenizer = AutoTokenizer.from_pretrained(model_name)\n\n # GPT-2 specific: Set pad token (GPT-2 doesn't have one by default)\n if \"gpt2\" in model_name.lower():\n tokenizer.pad_token = tokenizer.eos_token\n tokenizer.padding_side = \"left\" # GPT-2 works better with left padding\n model.config.pad_token_id = model.config.eos_token_id\n\n return (model, tokenizer)",
176
+ "execution_count": null,
177
+ "outputs": []
178
+ },
179
+ {
180
+ "cell_type": "code",
181
+ "source": "# Simple grid search across all config combinations = 4 total (2 LoRA configs × 2 training configs)\nconfig_group = RFGridSearch(\n configs=config_set_lite,\n trainer_type=\"SFT\"\n)",
182
+ "metadata": {},
183
+ "execution_count": null,
184
+ "outputs": []
185
+ },
186
+ {
187
+ "cell_type": "markdown",
188
+ "source": "## Interactive Run Controller\n\nUse the Interactive Controller to monitor and manage training runs in real-time:",
189
+ "metadata": {}
190
+ },
191
+ {
192
+ "cell_type": "code",
193
+ "execution_count": null,
194
+ "metadata": {},
195
+ "outputs": [],
196
+ "source": [
197
+ "# Simple grid search\n",
198
+ "config_group = RFGridSearch(\n",
199
+ " configs=config_set,\n",
200
+ " trainer_type=\"SFT\"\n",
201
+ ")"
202
+ ]
203
+ },
204
+ {
205
+ "cell_type": "markdown",
206
+ "source": "## Run Training\n\nNow let's start training! The metrics will appear in TensorBoard above in real-time:",
207
+ "metadata": {}
208
+ },
209
+ {
210
+ "cell_type": "markdown",
211
+ "source": "## Interactive Run Controller\n\nRapidFire provides an Interactive Controller that lets you manage training runs in real-time from your notebook:\n\n- ▶️ **Resume**: Resume a stopped run\n- ⏹️ **Stop**: Gracefully stop a running experiment\n- 🗑️ **Delete**: Remove a run from the database\n- 📋 **Clone**: Create a new run with modified hyperparameters (with optional warm start)\n- 🔄 **Refresh**: Update run status and metrics\n\nThe controller uses ipywidgets and is compatible with both Colab (ipywidgets 7.x) and Jupyter (ipywidgets 8.x).",
212
+ "metadata": {}
213
+ },
214
+ {
215
+ "cell_type": "code",
216
+ "execution_count": null,
217
+ "metadata": {},
218
+ "outputs": [],
219
+ "source": "# Launch training\nexperiment.run_fit(\n config_group, \n sample_create_model, \n train_dataset, \n eval_dataset, \n num_chunks=4, # 4 chunks for parallel execution\n seed=42\n)"
220
+ },
221
+ {
222
+ "cell_type": "markdown",
223
+ "metadata": {},
224
+ "source": [
225
+ "## End Experiment"
226
+ ]
227
+ },
228
+ {
229
+ "cell_type": "code",
230
+ "execution_count": null,
231
+ "metadata": {},
232
+ "outputs": [],
233
+ "source": [
234
+ "# View final logs\n",
235
+ "%tensorboard --logdir {tensorboard_log_dir}"
236
+ ]
237
+ },
238
+ {
239
+ "cell_type": "markdown",
240
+ "source": "## View TensorBoard Logs\n\nAfter training completes, you can view the full logs in TensorBoard:",
241
+ "metadata": {}
242
+ },
243
+ {
244
+ "cell_type": "code",
245
+ "source": "# View final logs\n%tensorboard --logdir {tensorboard_log_dir}",
246
+ "metadata": {},
247
+ "execution_count": null,
248
+ "outputs": []
249
+ },
250
+ {
251
+ "cell_type": "markdown",
252
+ "metadata": {},
253
+ "source": [
254
+ "## Using Both MLflow and TensorBoard\n",
255
+ "\n",
256
+ "You can also log to both backends simultaneously by setting:\n",
257
+ "\n",
258
+ "```python\n",
259
+ "os.environ['RF_TRACKING_BACKEND'] = 'both'\n",
260
+ "```\n",
261
+ "\n",
262
+ "This gives you:\n",
263
+ "- **TensorBoard**: Real-time visualization during training\n",
264
+ "- **MLflow**: Experiment comparison and model registry\n",
265
+ "\n",
266
+ "## Tips for Colab + TensorBoard\n",
267
+ "\n",
268
+ "1. **Start TensorBoard first**: Always start TensorBoard before training\n",
269
+ "2. **Frequent logging**: Set `logging_steps` to a small value (e.g., 2-5) for responsive updates\n",
270
+ "3. **Refresh rate**: TensorBoard polls logs every 30 seconds in Colab\n",
271
+ "4. **Multiple experiments**: Use different experiment names for different runs\n",
272
+ "5. **Clean logs**: Delete old logs with `!rm -rf {tensorboard_log_dir}` to start fresh\n",
273
+ "\n",
274
+ "## Comparison: TensorBoard vs MLflow in Colab\n",
275
+ "\n",
276
+ "| Feature | TensorBoard | MLflow |\n",
277
+ "|---------|-------------|--------|\n",
278
+ "| Real-time updates | ✅ Yes (30s polling) | ❌ No (frontend load time) |\n",
279
+ "| Colab native | ✅ %tensorboard magic | ❌ Requires tunneling |\n",
280
+ "| Load time | ✅ Instant | ❌ 3-5 minutes via tunnel |\n",
281
+ "| Model registry | ❌ No | ✅ Yes |\n",
282
+ "| Experiment comparison | ✅ Basic | ✅ Advanced |\n",
283
+ "\n",
284
+ "**Recommendation**: Use `'both'` backend to get the best of both worlds!\n",
285
+ "\n",
286
+ "## Next Steps\n",
287
+ "\n",
288
+ "- Try different model configs and compare in TensorBoard\n",
289
+ "- Experiment with `'both'` backend for comprehensive tracking\n",
290
+ "- Check out other RapidFire tutorials for DPO and GRPO training\n",
291
+ "\n",
292
+ "Happy training! 🚀"
293
+ ]
294
+ }
295
+ ],
296
+ "metadata": {
297
+ "kernelspec": {
298
+ "display_name": "Python 3",
299
+ "language": "python",
300
+ "name": "python3"
301
+ },
302
+ "language_info": {
303
+ "name": "python",
304
+ "version": "3.12"
305
+ },
306
+ "colab": {
307
+ "provenance": [],
308
+ "gpuType": "T4"
309
+ },
310
+ "accelerator": "GPU"
311
+ },
312
+ "nbformat": 4,
313
+ "nbformat_minor": 0
314
+ }