PyPI - vlora-dev - Versions diffs - 0.2.1__tar.gz → 0.3.0__tar.gz - Mend

vlora-dev 0.2.1tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (78) hide show

{vlora_dev-0.2.1 → vlora_dev-0.3.0}/.gitignore RENAMED Viewed

@@ -18,6 +18,4 @@ env/
 *.bin
 .DS_Store
 .claude/
-website/node_modules/
-website/dist/
-website/.astro/
+website/

vlora_dev-0.3.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,54 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+Format follows [Keep a Changelog](https://keepachangelog.com/).
+## [0.3.0] - 2026-03-30
+### Added
+- **NF4 quantization** — 4-bit NormalFloat quantization from QLoRA (Dettmers et al., 2023). `subspace.quantize(method="nf4")` uses 16 quantile levels optimized for normally-distributed weights, with per-block absmax scaling. Lower error than symmetric int4.
+- **Double quantization** — quantize per-block NF4 scales to FP8 via `double_quant=True`, reducing scale overhead from 0.5 to ~0.127 bits/param.
+- **NF4 packed storage** — `subspace.save_quantized()` packs components as uint8 (two 4-bit indices per byte) for ~7x disk savings. `SharedSubspace.load()` auto-detects format.
+- **QLoRA-aware VLoRAModel** — `compute_dtype` parameter for mixed-precision LoRA computation with quantized base models; `qlora_info` property for base model introspection.
+- **`full_stack_compression()`** — report combined base model quantization + adapter compression savings.
+- **`quantize_loadings` parameter** — optionally quantize per-task loadings (not just components).
+- **`nf4_pack` / `nf4_unpack`** — low-level ops for 4-bit packing to uint8.
+- **Layer shapes stored in metadata** — `reconstruct()` uses stored shapes instead of deriving from `numel() // rank`, supporting per-layer rank configs.
+- **`__repr__` on core objects** — `SharedSubspace`, `TaskProjection`, `LoRAWeights` now print useful info.
+- **`adaptive_k` preserved through `absorb()`** — subspaces built with `adaptive_k=True` retain that setting after absorption.
+- QLoRA + vLoRA pipeline example (`examples/qlora_pipeline.py`).
+### Fixed
+- **`absorb_incremental` re-projection bug** — existing tasks were having loadings padded/truncated instead of properly re-projected when the basis rotated. Now reconstructs from old basis and projects onto updated basis.
+- **`VLoRACallback` was a no-op** — the HF Trainer callback created an optimizer but never stepped it. Now registers differentiable forward hooks so the Trainer's backward pass produces gradients on loadings, and steps the optimizer in `on_step_end`.
+- **TIES merge normalization** — `n / contributor_count` over-scaled output when elements were trimmed. Fixed to `1 / contributor_count`.
+- **`__version__` mismatch** — `__init__.py` said 0.1.0 while `pyproject.toml` said 0.2.1.
+- **`check_tensor_health` never called** — imported but unused; now wired up after SVD in `from_adapters`.
+- **Task ID collision** — `absorb()` and `absorb_incremental()` now warn when overwriting an existing task ID.
+- **Filesystem-unsafe task IDs** — `save()` now sanitizes task IDs for filenames (handles `/`, `:`, spaces) with a mapping in metadata for lossless round-trip.
+- **`from_adapters_streaming` missing validation** — now checks `len(task_ids) == len(adapter_paths)`.
+### Changed
+- **`gram_schmidt` uses QR factorization** — replaced O(k^2 * D) inner loop with `torch.linalg.qr` for better performance and numerical stability.
+- **VLoRAModel caches module handles** — `_apply_hooks` no longer scans all `named_modules()` on every task switch.
+- **VLoRAModel inference hooks wrapped in `torch.no_grad()`** — prevents unnecessary autograd tracking.
+- **NF4 quantization uses `torch.bucketize`** — replaced O(N*16) distance broadcast with binary search, reducing memory from O(N*16) to O(N).
+- **`_LORA_KEY_RE` handles multi-adapter PEFT format** — supports `base_model.model.{layer}.lora_A.{adapter_name}.weight`.
+- **`save_adapter` no longer hardcodes `CAUSAL_LM`** — task type left for PEFT to infer.
+- Repo URL updated to `github.com/vlora-dev/vlora`.
+## [0.2.1] - 2026-02-10
+Initial public release on PyPI as `vlora-dev`.
+### Added
+- `SharedSubspace` — 3-step algorithm: from_adapters, project, absorb
+- `VLoRAModel` — inference wrapper with forward hooks
+- `SubspaceTrainer` — loadings-only training
+- `TaskRouter` — per-input adapter routing
+- `task_arithmetic`, `ties_merge`, `dare_merge` — adapter merging
+- Analysis tools: similarity matrix, clustering, outlier detection
+- CLI with 9 commands
+- HuggingFace Trainer integration via `VLoRACallback`
+- Streaming and incremental subspace construction

{vlora_dev-0.2.1 → vlora_dev-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: vlora-dev
-Version: 0.2.1
-Summary: Shared low-rank subspaces for efficient LoRA adapter management
+Version: 0.3.0
+Summary: Various LoRA adapters. One shared basis. Up to 122x compression at scale.
 Project-URL: Homepage, https://github.com/tveseli/vlora
 Project-URL: Repository, https://github.com/tveseli/vlora
 Author: Tim Veseli
@@ -39,10 +39,10 @@ Description-Content-Type: text/markdown
 </p>
 <p align="center">
-  <strong>Shared low-rank subspaces for efficient LoRA adapter management.</strong>
+  <strong>Various LoRA adapters. One shared basis.</strong>
 </p>
-Based on the [Share paper](https://arxiv.org/abs/2602.06043): LoRA adapters across tasks share a common low-rank subspace. Instead of storing *N* separate adapters, maintain **one shared basis** and **per-task coefficient vectors** — achieving up to 122× compression at scale.
+Your adapters share more structure than you think. vLoRA finds the common basis and stores each adapter as a tiny coefficient vector — up to 122× compression at scale. Based on the [Share paper](https://arxiv.org/abs/2602.06043).
 ## Install
@@ -52,7 +52,7 @@ pip install vlora-dev
 Or from source:
 ```bash
-git clone https://github.com/tveseli/vlora.git
+git clone https://github.com/vlora-dev/vlora.git
 cd vlora
 pip install -e ".[dev]"
 ```
@@ -137,6 +137,77 @@ output = model(input_ids)
 print(model.available_tasks)  # ["task_0", "task_1", ...]
 ```
+## QLoRA Support
+vLoRA has first-class support for [QLoRA](https://arxiv.org/abs/2305.14314) workflows. QLoRA compresses the **base model** (FP16 → 4-bit NF4), while vLoRA compresses the **adapter space** — these are orthogonal and stack multiplicatively.
+### NF4 Quantization
+Quantize subspace components using the same NF4 data type from QLoRA — 16 quantile levels optimized for normally-distributed weights:
+```python
+# NF4 quantization (better than symmetric int4 for normal-ish weights)
+subspace.quantize(method="nf4")
+# With double quantization (quantize the per-block scales too)
+subspace.quantize(method="nf4", double_quant=True)
+# Also quantize loadings (effective when loadings are approximately normal)
+subspace.quantize(method="nf4", quantize_loadings=True)
+```
+### Packed NF4 Storage
+Save subspace in packed 4-bit format for ~7× disk savings:
+```python
+# Save: packs components as uint8 (two 4-bit values per byte)
+subspace.save_quantized("shared_subspace/")
+# Load: auto-detects format, dequantizes on the fly
+subspace = SharedSubspace.load("shared_subspace/")
+```
+### QLoRA Base Model
+`VLoRAModel` works with quantized base models loaded via bitsandbytes:
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from vlora import VLoRAModel, SharedSubspace
+# Load 4-bit base model
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+base_model = AutoModelForCausalLM.from_pretrained("model-name", quantization_config=bnb_config)
+# Wrap with vLoRA — compute_dtype ensures LoRA math runs in BF16
+subspace = SharedSubspace.load("shared_subspace/")
+model = VLoRAModel(base_model, subspace, compute_dtype=torch.bfloat16)
+print(model.qlora_info)  # {'quantized': True, 'method': 'nf4', ...}
+model.set_task("task_0")
+output = model(input_ids)
+```
+### Full-Stack Compression
+Report combined savings across base model quantization and adapter compression:
+```python
+stats = subspace.full_stack_compression(
+    base_model_params=7_000_000_000,  # 7B model
+    base_model_bits=16,               # original FP16
+    quantized_bits=4,                 # QLoRA NF4
+)
+# → {'total_compression_ratio': 4.0, 'total_original_bytes': 14.0 GB, ...}
+```
+See [`examples/qlora_pipeline.py`](examples/qlora_pipeline.py) for a complete end-to-end example.
 ## Training in the Subspace
 Train only the loadings vector (k params per layer) instead of full LoRA matrices — 100×+ parameter reduction:
@@ -219,8 +290,10 @@ merged = dare_merge(adapters, drop_rate=0.5, seed=42)
 # Adaptive k: different components per layer based on explained variance
 subspace = SharedSubspace.from_adapters(adapters, adaptive_k=True, variance_threshold=0.9)
-# Quantize components for smaller memory footprint
-subspace.quantize(bits=8)  # or bits=4
+# Quantize components — symmetric (int8/int4) or NF4
+subspace.quantize(bits=8)                        # symmetric int8
+subspace.quantize(method="nf4")                  # NF4 4-bit (better for normal weights)
+subspace.quantize(method="nf4", double_quant=True)  # + quantize the scales
 # Check compression stats
 stats = subspace.compression_stats()
@@ -267,14 +340,16 @@ subspace.to(device="cuda", dtype=torch.float16)
   - `.absorb(adapter, task_id)` — Incorporate + recompute (full SVD)
   - `.absorb_incremental(adapter, task_id)` — Fast incremental update
   - `.get_trainable_params(task_id)` — For training integration
-  - `.quantize(bits=8)` — Quantize components (int8/int4)
+  - `.quantize(bits=8, method="symmetric")` — Quantize components (int8/int4/NF4)
   - `.compression_stats()` — Compression ratio and parameter counts
+  - `.full_stack_compression(base_model_params)` — Combined base + adapter stats
   - `.to(device, dtype)` — Move tensors to device/dtype
-  - `.save(path)` / `.load(path)` — Serialization
+  - `.save(path)` / `.save_quantized(path)` / `.load(path)` — Serialization (NF4-packed auto-detected)
 ### Model Integration
-- **`VLoRAModel(base_model, subspace, lora_alpha=None)`** — Inference wrapper with forward hooks
+- **`VLoRAModel(base_model, subspace, lora_alpha=None, compute_dtype=None)`** — Inference wrapper with forward hooks
+  - `.qlora_info` — Base model quantization metadata
   - `.set_task(task_id)` — Switch adapter (cached)
   - `.clear_task()` — Remove adapter
   - `.available_tasks` — List task IDs
@@ -325,6 +400,7 @@ subspace.to(device="cuda", dtype=torch.float16)
 - `compute_svd`, `project_onto_subspace`, `reconstruct_from_subspace`
 - `gram_schmidt`, `explained_variance_ratio`, `select_num_components`
 - `incremental_svd_update`
+- `nf4_quantize_dequantize`, `nf4_pack`, `nf4_unpack` — NF4 quantization (QLoRA)
 ## Benchmarks — Real-World Adapters

{vlora_dev-0.2.1 → vlora_dev-0.3.0}/README.md RENAMED Viewed

@@ -3,10 +3,10 @@
 </p>
 <p align="center">
-  <strong>Shared low-rank subspaces for efficient LoRA adapter management.</strong>
+  <strong>Various LoRA adapters. One shared basis.</strong>
 </p>
-Based on the [Share paper](https://arxiv.org/abs/2602.06043): LoRA adapters across tasks share a common low-rank subspace. Instead of storing *N* separate adapters, maintain **one shared basis** and **per-task coefficient vectors** — achieving up to 122× compression at scale.
+Your adapters share more structure than you think. vLoRA finds the common basis and stores each adapter as a tiny coefficient vector — up to 122× compression at scale. Based on the [Share paper](https://arxiv.org/abs/2602.06043).
 ## Install
@@ -16,7 +16,7 @@ pip install vlora-dev
 Or from source:
 ```bash
-git clone https://github.com/tveseli/vlora.git
+git clone https://github.com/vlora-dev/vlora.git
 cd vlora
 pip install -e ".[dev]"
 ```
@@ -101,6 +101,77 @@ output = model(input_ids)
 print(model.available_tasks)  # ["task_0", "task_1", ...]
 ```
+## QLoRA Support
+vLoRA has first-class support for [QLoRA](https://arxiv.org/abs/2305.14314) workflows. QLoRA compresses the **base model** (FP16 → 4-bit NF4), while vLoRA compresses the **adapter space** — these are orthogonal and stack multiplicatively.
+### NF4 Quantization
+Quantize subspace components using the same NF4 data type from QLoRA — 16 quantile levels optimized for normally-distributed weights:
+```python
+# NF4 quantization (better than symmetric int4 for normal-ish weights)
+subspace.quantize(method="nf4")
+# With double quantization (quantize the per-block scales too)
+subspace.quantize(method="nf4", double_quant=True)
+# Also quantize loadings (effective when loadings are approximately normal)
+subspace.quantize(method="nf4", quantize_loadings=True)
+```
+### Packed NF4 Storage
+Save subspace in packed 4-bit format for ~7× disk savings:
+```python
+# Save: packs components as uint8 (two 4-bit values per byte)
+subspace.save_quantized("shared_subspace/")
+# Load: auto-detects format, dequantizes on the fly
+subspace = SharedSubspace.load("shared_subspace/")
+```
+### QLoRA Base Model
+`VLoRAModel` works with quantized base models loaded via bitsandbytes:
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from vlora import VLoRAModel, SharedSubspace
+# Load 4-bit base model
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+base_model = AutoModelForCausalLM.from_pretrained("model-name", quantization_config=bnb_config)
+# Wrap with vLoRA — compute_dtype ensures LoRA math runs in BF16
+subspace = SharedSubspace.load("shared_subspace/")
+model = VLoRAModel(base_model, subspace, compute_dtype=torch.bfloat16)
+print(model.qlora_info)  # {'quantized': True, 'method': 'nf4', ...}
+model.set_task("task_0")
+output = model(input_ids)
+```
+### Full-Stack Compression
+Report combined savings across base model quantization and adapter compression:
+```python
+stats = subspace.full_stack_compression(
+    base_model_params=7_000_000_000,  # 7B model
+    base_model_bits=16,               # original FP16
+    quantized_bits=4,                 # QLoRA NF4
+)
+# → {'total_compression_ratio': 4.0, 'total_original_bytes': 14.0 GB, ...}
+```
+See [`examples/qlora_pipeline.py`](examples/qlora_pipeline.py) for a complete end-to-end example.
 ## Training in the Subspace
 Train only the loadings vector (k params per layer) instead of full LoRA matrices — 100×+ parameter reduction:
@@ -183,8 +254,10 @@ merged = dare_merge(adapters, drop_rate=0.5, seed=42)
 # Adaptive k: different components per layer based on explained variance
 subspace = SharedSubspace.from_adapters(adapters, adaptive_k=True, variance_threshold=0.9)
-# Quantize components for smaller memory footprint
-subspace.quantize(bits=8)  # or bits=4
+# Quantize components — symmetric (int8/int4) or NF4
+subspace.quantize(bits=8)                        # symmetric int8
+subspace.quantize(method="nf4")                  # NF4 4-bit (better for normal weights)
+subspace.quantize(method="nf4", double_quant=True)  # + quantize the scales
 # Check compression stats
 stats = subspace.compression_stats()
@@ -231,14 +304,16 @@ subspace.to(device="cuda", dtype=torch.float16)
   - `.absorb(adapter, task_id)` — Incorporate + recompute (full SVD)
   - `.absorb_incremental(adapter, task_id)` — Fast incremental update
   - `.get_trainable_params(task_id)` — For training integration
-  - `.quantize(bits=8)` — Quantize components (int8/int4)
+  - `.quantize(bits=8, method="symmetric")` — Quantize components (int8/int4/NF4)
   - `.compression_stats()` — Compression ratio and parameter counts
+  - `.full_stack_compression(base_model_params)` — Combined base + adapter stats
   - `.to(device, dtype)` — Move tensors to device/dtype
-  - `.save(path)` / `.load(path)` — Serialization
+  - `.save(path)` / `.save_quantized(path)` / `.load(path)` — Serialization (NF4-packed auto-detected)
 ### Model Integration
-- **`VLoRAModel(base_model, subspace, lora_alpha=None)`** — Inference wrapper with forward hooks
+- **`VLoRAModel(base_model, subspace, lora_alpha=None, compute_dtype=None)`** — Inference wrapper with forward hooks
+  - `.qlora_info` — Base model quantization metadata
   - `.set_task(task_id)` — Switch adapter (cached)
   - `.clear_task()` — Remove adapter
   - `.available_tasks` — List task IDs
@@ -289,6 +364,7 @@ subspace.to(device="cuda", dtype=torch.float16)
 - `compute_svd`, `project_onto_subspace`, `reconstruct_from_subspace`
 - `gram_schmidt`, `explained_variance_ratio`, `select_num_components`
 - `incremental_svd_update`
+- `nf4_quantize_dequantize`, `nf4_pack`, `nf4_unpack` — NF4 quantization (QLoRA)
 ## Benchmarks — Real-World Adapters

{vlora_dev-0.2.1 → vlora_dev-0.3.0}/docs/index.md RENAMED Viewed

@@ -1,8 +1,8 @@
 # vlora
-**Shared low-rank subspaces for efficient LoRA adapter management.**
+**Various LoRA adapters. One shared basis.**
-Based on the [Share paper](https://arxiv.org/abs/2602.06043): LoRA adapters across tasks share a common low-rank subspace. Instead of storing *N* separate adapters, maintain **one shared basis** and **per-task coefficient vectors** — achieving up to 122× compression at scale.
+Your adapters share more structure than you think. vLoRA finds the common basis and stores each adapter as a tiny coefficient vector — up to 122× compression at scale. Based on the [Share paper](https://arxiv.org/abs/2602.06043).
 ## Install

vlora_dev-0.3.0/examples/qlora_pipeline.py ADDED Viewed

@@ -0,0 +1,154 @@
+"""QLoRA + vLoRA: End-to-end pipeline for efficient multi-adapter serving.
+This example shows the full workflow:
+1. Load a QLoRA-quantized base model (4-bit NF4)
+2. Load multiple LoRA adapters (produced by QLoRA fine-tuning)
+3. Build a shared subspace with NF4 quantization
+4. Serve with instant task switching via VLoRAModel
+Requirements:
+    pip install vlora-dev[hub] transformers bitsandbytes accelerate
+The pipeline combines two orthogonal compression techniques:
+- QLoRA: compresses the base model (FP16 -> NF4, ~4x savings)
+- vLoRA: compresses the adapter space (N adapters -> shared subspace, ~122x)
+Together they enable serving hundreds of task-specific adapters on a single GPU.
+"""
+from __future__ import annotations
+import torch
+# ── Step 0: Configuration ──────────────────────────────────────────────
+BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Small model for demo
+ADAPTER_REPOS = [
+    # Replace with your QLoRA adapter repos from HuggingFace Hub
+    # "username/adapter-task-a",
+    # "username/adapter-task-b",
+]
+NUM_COMPONENTS = 4  # Subspace dimension
+USE_NF4_STORAGE = True  # Save subspace in packed NF4 format
+def main():
+    # ── Step 1: Load QLoRA base model ──────────────────────────────────
+    # In production, load with 4-bit quantization:
+    #
+    #   from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+    #   bnb_config = BitsAndBytesConfig(
+    #       load_in_4bit=True,
+    #       bnb_4bit_quant_type="nf4",
+    #       bnb_4bit_compute_dtype=torch.bfloat16,
+    #   )
+    #   base_model = AutoModelForCausalLM.from_pretrained(
+    #       BASE_MODEL, quantization_config=bnb_config
+    #   )
+    #
+    # For this demo, we simulate with synthetic data:
+    print("=== QLoRA + vLoRA Pipeline Demo ===\n")
+    # ── Step 2: Load adapters ──────────────────────────────────────────
+    from vlora import LoRAWeights, SharedSubspace, VLoRAModel
+    print("Creating synthetic adapters (replace with load_adapter_from_hub)...")
+    layers = [
+        "model.layers.0.self_attn.q_proj",
+        "model.layers.0.self_attn.v_proj",
+        "model.layers.1.self_attn.q_proj",
+        "model.layers.1.self_attn.v_proj",
+    ]
+    rank = 8
+    dim = 512
+    n_adapters = 10
+    # Create correlated adapters (simulates real LoRA adapters sharing structure)
+    torch.manual_seed(42)
+    shared_basis = {l: torch.randn(5, rank * dim) for l in layers}
+    adapters = []
+    task_ids = []
+    for i in range(n_adapters):
+        lora_a = {l: (torch.randn(5) @ shared_basis[l]).reshape(rank, dim) for l in layers}
+        lora_b = {l: torch.randn(dim, rank) * 0.01 for l in layers}
+        adapters.append(LoRAWeights(layer_names=layers, lora_a=lora_a, lora_b=lora_b, rank=rank))
+        task_ids.append(f"task_{i}")
+    print(f"  Loaded {n_adapters} adapters, rank={rank}, {len(layers)} layers\n")
+    # ── Step 3: Build shared subspace ──────────────────────────────────
+    print("Building shared subspace...")
+    subspace = SharedSubspace.from_adapters(
+        adapters,
+        task_ids=task_ids,
+        num_components=NUM_COMPONENTS,
+    )
+    stats = subspace.compression_stats()
+    print(f"  Components: {subspace.num_components}")
+    print(f"  Compression: {stats['compression_ratio']:.1f}x")
+    print(f"  Original params:   {stats['total_params_original']:,}")
+    print(f"  Compressed params: {stats['total_params_compressed']:,}\n")
+    # ── Step 4: Apply NF4 quantization to subspace ─────────────────────
+    print("Quantizing subspace with NF4...")
+    subspace.quantize(method="nf4", quantize_loadings=True)
+    print("  Done (components + loadings quantized)\n")
+    # ── Step 5: Save with packed NF4 storage ───────────────────────────
+    import tempfile
+    from pathlib import Path
+    save_dir = Path(tempfile.mkdtemp()) / "subspace"
+    if USE_NF4_STORAGE:
+        print("Saving with NF4-packed format...")
+        subspace.save_quantized(save_dir)
+    else:
+        print("Saving with float32 format...")
+        subspace.save(save_dir)
+    # Compare file sizes
+    total_bytes = sum(f.stat().st_size for f in save_dir.rglob("*") if f.is_file())
+    print(f"  Saved to: {save_dir}")
+    print(f"  Total size: {total_bytes / 1024:.1f} KB\n")
+    # ── Step 6: Load and serve ─────────────────────────────────────────
+    print("Loading subspace (auto-detects format)...")
+    loaded = SharedSubspace.load(save_dir)
+    print(f"  {loaded!r}\n")
+    # Full-stack compression stats (with hypothetical QLoRA base model)
+    full_stats = loaded.full_stack_compression(
+        base_model_params=1_100_000_000,  # TinyLlama 1.1B
+        base_model_bits=16,
+        quantized_bits=4,
+    )
+    if "total_compression_ratio" in full_stats:
+        print("Full-stack compression (QLoRA base + vLoRA adapters):")
+        print(f"  Base model: {full_stats['base_model']['compression_ratio']:.1f}x (FP16->NF4)")
+        print(f"  Adapters:   {stats['compression_ratio']:.1f}x ({n_adapters} adapters)")
+        print(f"  Total:      {full_stats['total_original_bytes']/1e9:.1f} GB -> "
+              f"{full_stats['total_compressed_bytes']/1e9:.2f} GB")
+        print(f"  Combined:   {full_stats['total_compression_ratio']:.1f}x\n")
+    # In production with a real base model:
+    #
+    #   model = VLoRAModel(base_model, loaded, compute_dtype=torch.bfloat16)
+    #   print(f"QLoRA info: {model.qlora_info}")
+    #
+    #   # Instant task switching
+    #   model.set_task("task_0")
+    #   output = model(input_ids)
+    #
+    #   model.set_task("task_5")  # microseconds to switch
+    #   output = model(input_ids)
+    # Demonstrate reconstruction
+    print("Reconstructing adapters from subspace...")
+    for tid in ["task_0", "task_5", "task_9"]:
+        recon = loaded.reconstruct(tid)
+        print(f"  {tid}: {recon!r}")
+    print("\nDone!")
+if __name__ == "__main__":
+    main()

vlora_dev-0.3.0/icon.png ADDED Viewed

Binary file

vlora_dev-0.3.0/logo.png ADDED Viewed

Binary file

{vlora_dev-0.2.1 → vlora_dev-0.3.0}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
 [project]
 name = "vlora-dev"
-version = "0.2.1"
-description = "Shared low-rank subspaces for efficient LoRA adapter management"
+version = "0.3.0"
+description = "Various LoRA adapters. One shared basis. Up to 122x compression at scale."
 readme = "README.md"
 license = "Apache-2.0"
 requires-python = ">=3.9"

{vlora_dev-0.2.1 → vlora_dev-0.3.0}/src/vlora/__init__.py RENAMED Viewed

@@ -5,13 +5,17 @@ share a common low-rank subspace. Instead of storing N separate adapters,
 maintain one shared basis and per-task coefficient vectors.
 """
-__version__ = "0.1.0"
+__version__ = "0.3.0"
 from vlora.io import LoRAWeights, load_adapter, load_adapter_from_hub, save_adapter
 from vlora.ops import (
+    NF4_QUANT_TABLE,
     compute_svd,
     explained_variance_ratio,
     gram_schmidt,
+    nf4_pack,
+    nf4_quantize_dequantize,
+    nf4_unpack,
     project_onto_subspace,
     reconstruct_from_subspace,
     select_num_components,
@@ -51,6 +55,11 @@ __all__ = [
     "gram_schmidt",
     "explained_variance_ratio",
     "select_num_components",
+    # NF4 quantization (QLoRA-style)
+    "NF4_QUANT_TABLE",
+    "nf4_quantize_dequantize",
+    "nf4_pack",
+    "nf4_unpack",
     # Analysis
     "compute_similarity_matrix",
     "find_clusters",

vlora-dev 0.2.1__tar.gz → 0.3.0__tar.gz

vlora-dev 0.2.1tar.gz → 0.3.0tar.gz