PyPI - dmx-compress - Versions diffs - 0.3.0__tar.gz - Mend

dmx-compress 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

dmx_compress-0.3.0/PKG-INFO +401 -0
dmx_compress-0.3.0/README.md +372 -0
dmx_compress-0.3.0/dmx_cli.py +1879 -0
dmx_compress-0.3.0/dmx_compress.egg-info/PKG-INFO +401 -0
dmx_compress-0.3.0/dmx_compress.egg-info/SOURCES.txt +9 -0
dmx_compress-0.3.0/dmx_compress.egg-info/dependency_links.txt +1 -0
dmx_compress-0.3.0/dmx_compress.egg-info/entry_points.txt +2 -0
dmx_compress-0.3.0/dmx_compress.egg-info/requires.txt +9 -0
dmx_compress-0.3.0/dmx_compress.egg-info/top_level.txt +1 -0
dmx_compress-0.3.0/pyproject.toml +47 -0
dmx_compress-0.3.0/setup.cfg +4 -0

dmx_compress-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,401 @@
+Metadata-Version: 2.4
+Name: dmx-compress
+Version: 0.3.0
+Summary: DMX — Delta Multiplexed Model Format. Near-lossless neural network weight compression.
+Author-email: "William J. Riley" <bill.riley@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/willjriley/dmx
+Project-URL: Repository, https://github.com/willjriley/dmx
+Project-URL: Issues, https://github.com/willjriley/dmx/issues
+Keywords: compression,neural-network,model-compression,checkpoint,safetensors,deep-learning
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: System :: Archiving :: Compression
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: torch>=2.0
+Requires-Dist: numpy
+Requires-Dist: zstandard
+Requires-Dist: safetensors
+Provides-Extra: gpu
+Requires-Dist: torch>=2.0; extra == "gpu"
+Provides-Extra: lpc
+# DMX — Delta Multiplexed Model Format
+**A new compression format for neural network weights.**
+```
+Original:  9.1 GB  (SVD-XT, FP32 — 80% includes FP32→FP16 conversion)
+DMX:       1.8 GB
+```
+```
+Original:  7.2 GB  (Wan 2.2 14B shard, FP32 — 79.5% includes FP32→FP16)
+DMX:       1.5 GB  (142/142 tensors verified)
+```
+```
+Original:  16 GB   (Llama 3 8B, FP16 — 55% pure FP16 compression)
+DMX:       ~7.2 GB (+0.16% perplexity on wikitext-2)
+```
+### Try it now
+```bash
+pip install dmx-compress
+dmx compress your_model.safetensors compressed.dmx
+```
+### Download pre-compressed models now
+| Model | Original | DMX | Savings | Verified |
+|-------|----------|-----|---------|----------|
+| [Wan 1.3B](https://huggingface.co/Senat1/dmx-wan-1.3b) | 2.7 GB | 1.1 GB | 60% | 825/825 tensors |
+| [Wan 2.2 shard](https://huggingface.co/Senat1/dmx-wan2.2-shard6) | 7.2 GB | 1.5 GB | 79.5% | 142/142 tensors |
+| [SVD-XT](https://huggingface.co/Senat1/dmx-svd-xt) | 9.1 GB | 1.8 GB | 80% | Roundtrip verified |
+---
+### Why this matters for frontier training
+**From first principles:** In high-precision training, a full checkpoint is effectively a near-complete copy of the model state — weights (BF16/FP32) plus optimizer states (often another 2-4x the size). Each save is massive. Teams are forced to make checkpoints sparse (every few thousand steps) to keep storage and I/O under control. This is not a bug in the tools — it is a direct consequence of the numbers involved.
+**The operational reality:** Frontier training runs are now routinely $50M-$200M+ each. While accelerators dominate the budget, checkpoint storage, bandwidth, and recovery time are real, recurring costs. Infra teams track these "quiet" expenses closely.
+**Where DMX changes the equation:** One high-precision baseline anchor + many tiny, exact integer deltas with zero error accumulation (see [Test 8](#zero-error-accumulation-in-delta-chains-test-8) below). 200 checkpoints of a Llama 70B-class model are projected to drop from ~28 TB raw to ~3 TB while remaining mathematically safe for resumption, branching, and analysis.
+| Aspect | Current Status Quo | With DMX Delta Chains | Benefit |
+|--------|-------------------|----------------------|---------|
+| Checkpoint frequency | Sparse (forced by cost) | Dense and safe | Better science and debugging |
+| Storage for 200 ckpts (70B) | ~28 TB | ~3 TB (projected) | ~9x reduction |
+| Resumption fidelity | Full copy required | Exact integer chain | Zero accumulation error (measured) |
+| Fine-tune distribution | Full copy per variant | Small delta per variant | 80% savings (measured on TinyLlama 1.1B) |
+This doesn't reduce the dominant cost (compute), but it meaningfully lowers a real operational friction point that every large lab deals with. It gives researchers and engineers far more usable history than was previously practical.
+---
+## What is DMX?
+DMX is a near-lossless post-training compression format for neural network weights, optimized for storage and distribution. It reduces model file sizes 55-80% while preserving model quality (+0.03-0.16% perplexity).
+- **No retraining required** — compress any pretrained safetensors model
+- **Reversible** — decompress back to the original format
+- **Broad compatibility** — tested on LLMs, diffusion models, video models, encoder-decoder
+### Storage and transfer comparison
+DMX is focused on reducing model size for storage and network transfer — not runtime inference. Here's how a 140 GB model (Llama 3 70B, FP16) compares across compression approaches:
+| Method | Compressed Size | Savings | Quality Loss | Purpose |
+|--------|----------------|---------|-------------|---------|
+| **safetensors** | 140 GB | 0% | None | Original format |
+| **gzip** | ~134 GB | ~4% | None | Generic compression (barely helps on floats) |
+| **zstd-19** | ~129 GB | ~8% | None | Better generic compression (still limited) |
+| **DFloat11** | ~98 GB | ~30% | None (lossless) | Lossless NN weight compression |
+| **ZipNN** | ~94 GB | ~33% | None (lossless) | Lossless NN weight compression |
+| **DMX M=7** | ~63 GB | ~55% | +0.03% PPL | **Near-original quality, high compression** |
+| **DMX M=6** | ~56 GB | ~60% | +0.16% PPL | **Aggressive storage compression** |
+For reference, quantized inference formats like GGUF Q8 (~50%) and Q4 (~75%) achieve similar or greater compression but are designed for a different purpose — running models directly at reduced precision with fused kernels. DMX and GGUF serve different needs and are not interchangeable.
+If lossless is enough, use DFloat11 or ZipNN. If you need to run inference at lower precision, use GGUF. If you need high compression with near-original quality for storage and distribution, that's where DMX lives.
+| Without DMX | With DMX |
+|-------------|----------|
+| Llama 3 70B: 140 GB download | ~36 GB download |
+| 4-5 models on 1 TB | 10+ models on 1 TB |
+### Training & DevOps use cases
+Beyond individual model compression, DMX's aligned quantization enables delta encoding between related model files — useful for training infrastructure and model distribution at scale.
+| Use Case | How DMX Helps |
+|----------|--------------|
+| **Checkpoint storage** | Delta-compress consecutive checkpoints (87.3% measured savings on GPT-2, validated on TinyLlama 1.1B). Both near-lossless (int16) and practically lossless (int32, error below FP32 noise floor) modes available. |
+| **Model distribution** | Distribute fine-tune variants as small deltas from a shared base model |
+| **Crash recovery** | Smaller checkpoints = faster reload from storage after GPU failure |
+| **Model versioning** | Aligned integer space enables meaningful diffs between model versions |
+### Why not just use existing versioning tools?
+Every existing ML versioning tool treats model files as opaque blobs:
+| Tool | Version tracking | Understands weight structure | Delta between versions |
+|------|:-:|:-:|:-:|
+| Git LFS / DVC | ✓ | ✗ | ✗ (full copy each version) |
+| HuggingFace Hub | ✓ | ✗ | ✗ (full copy each version) |
+| W&B / MLflow | ✓ | ✗ | ✗ (full copy each version) |
+| xdelta (binary diff) | ✗ | ✗ | 8.5% savings (noise) |
+| **DMX** | Planned | **✓** | **80-87% savings** |
+The difference: subtracting two model files in raw float produces noise (IEEE 754 bit layout destroys numerical proximity). DMX's aligned quantization creates a coordinate system where subtraction produces clean, sparse integers — enabling meaningful diffs, efficient deltas, and 80-87% compression between related models.
+These capabilities are under active development. See [Research Directions](#research-directions) for details and experimental results.
+## Key Properties
+- **Up to 80% compression** on FP32 models (SVD-XT: 9.1 GB -> 1.8 GB, verified roundtrip)
+- **60-74% compression** on FP16 models (Llama 3 8B, Mistral 7B, Wan 1.3B)
+- **55-60% near-lossless compression** on FP32 models (GPT-2, Phi-2 — +0.12-0.22% PPL)
+- **GPU-accelerated decompression**: 13.8x faster than CPU with `--gpu` flag
+- **Tested on**: LLMs (GPT-2, Llama 3, TinyLlama), diffusion (Wan, SVD-XT), encoder-decoder (T5)
+- **No training required**: pure post-training compression, works on any pretrained model
+## How It Works
+### BFP Mode (for FP16/BF16 models — recommended)
+```
+Standard FP16:  16 bits per weight (5-bit exponent wasted on unused dynamic range)
+DMX BFP:        ~7 bits per weight (shared exponent per group + truncated mantissa + entropy coding)
+```
+Trained weights cluster in a narrow magnitude range — 74% use only 3 of 31 possible exponents. DMX shares one exponent per group of 32 values, eliminating wasted dynamic range, then entropy-codes the mantissa stream.
+### int16 Mode (for FP32 models — near-lossless)
+```
+Standard FP32:  32 bits per weight
+DMX int16:      ~13 bits per weight (aligned cross-layer quantization + entropy coding)
+```
+Integer quantization as a preprocessing step (not a lossy final format) transforms float weights into a representation where entropy coding is effective. Aligned cross-layer quantization enforces a global coordinate system across layers, enabling structured compression.
+## Installation
+```bash
+pip install dmx-compress
+```
+Or from source:
+```bash
+git clone https://github.com/willjriley/dmx.git && cd dmx && pip install -e .
+```
+**Requirements:** Python 3.10+, PyTorch 2.0+. GPU (CUDA) is optional — automatically used when available for faster compression and decompression.
+## Quick Start
+```bash
+# Compress any safetensors model (auto-detects FP16 vs FP32)
+dmx compress model.safetensors model.dmx --mode auto
+# Practically lossless compression (FP32 models — error below FP32 noise floor)
+dmx compress model.safetensors model.dmx --mode int32
+# Decompress back to safetensors (auto-uses GPU if available)
+dmx decompress model.dmx model.safetensors
+# Verify roundtrip quality (with JSON report)
+dmx verify model.safetensors model.dmx --report verify.json
+# View compression info
+dmx info model.dmx
+```
+### Delta compression (checkpoint / model versioning)
+```bash
+# Delta-compress a checkpoint against a base (near-lossless, ~87% savings)
+dmx delta-compress base.safetensors checkpoint.safetensors delta.dmxd
+# Practically lossless delta (error below FP32 noise floor, ~87% savings)
+dmx delta-compress base.safetensors checkpoint.safetensors delta.dmxd --precision int32
+# Reconstruct checkpoint from base + delta
+dmx delta-reconstruct base.safetensors delta.dmxd restored.safetensors
+# View delta file info (sparsity, compression, per-component breakdown)
+dmx delta-info delta.dmxd
+```
+### Example: Compress and verify a model from HuggingFace
+```bash
+# Download a model
+pip install huggingface_hub
+huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./wan_model
+# Compress it
+dmx compress ./wan_model/model.safetensors wan_compressed.dmx
+# Decompress and verify
+dmx verify ./wan_model/model.safetensors wan_compressed.dmx --report report.json
+```
+## Decompression Speed
+| Model | Mode | CPU | GPU (--gpu) | Speedup |
+|-------|------|-----|-------------|---------|
+| Wan 1.3B | BFP | 185s | 13.4s | 13.8x |
+| SVD-XT | BFP | 281s | 22.3s | 12.5x |
+| SVD-XT | int16 | 10.5s | -- | CPU-bound |
+*Benchmarked on RTX 4090 Laptop, Python 3.13. BFP CPU bottleneck is numpy bit manipulation; GPU path uses PyTorch CUDA ops. A native C/CUDA decoder would be 10-50x faster still.*
+## Benchmarks
+### BFP Mode (FP16 models)
+| Model | Type | Original | DMX | Savings | Quality |
+|-------|------|----------|-----|---------|---------|
+| Llama 3 8B | LLM | 16 GB | ~7.2 GB | **55%** | **+0.16% PPL** (wikitext-2) |
+| Wan 2.2 shard | Video | 7.2 GB | 1.5 GB | **79.5%** | 142/142 tensors pass |
+| Wan 1.3B | Diffusion | 2.7 GB | 1.1 GB | **60%** | 825/825 tensors pass |
+| SVD-XT | Video | 9.1 GB | 1.8 GB | **80%** | Verified roundtrip |
+*Note: SVD-XT 80% includes FP32->FP16 conversion. Wan 2.2 79.5% is on FP32 source with BFP.*
+### BFP Quality-per-Bit (Llama 3 8B, wikitext-2, 289K tokens)
+| Config | Bits/Weight | Perplexity | vs FP16 |
+|--------|-----------|------------|---------|
+| FP16 baseline | 16.0 | 5.4958 | -- |
+| BFP(M=8) | 9.25 | 5.4964 | +0.01% |
+| BFP(M=7) | 8.25 | 5.4973 | **+0.03%** |
+| BFP(M=6) | 7.25 | 5.5045 | +0.16% |
+| *GGUF Q8_0 (ref)* | *8.50* | *~5.55-5.58* | *~1.0-1.5% (different purpose — inference format)* |
+### int16 Mode (FP32 models)
+| Model | Type | Original | DMX | Savings | PPL Change |
+|-------|------|----------|-----|---------|------------|
+| SVD-XT | Video | 8.9 GB | 4.0 GB | **55.5%** | Lossless |
+| GPT-2 | LLM | 475 MB | 201 MB | **57.7%** | +0.22% |
+| Phi-2 | LLM | 10.6 GB | 4.2 GB | **60.1%** | +0.12% |
+### Why DMX beats generic compression
+| Method | Bits/value | Notes |
+|--------|-----------|-------|
+| gzip on safetensors | ~15.5 | Raw floats look like noise |
+| zstd level 19 | 14.06 | Dictionary matching, no prediction |
+| **DMX int16 + entropy** | **11.45** | Aligned quantization enables structured entropy coding |
+| **DMX BFP + zstd** | **~4.2** | Shared exponent eliminates wasted dynamic range |
+## Pre-Compressed Models (Try It Now)
+Download DMX-compressed models and decompress them yourself:
+| Model | Original | DMX | Savings | Verified | Link |
+|-------|----------|-----|---------|----------|------|
+| Wan 1.3B (Diffusion) | 2.7 GB | 1.1 GB | 60% | 825/825 tensors | [Download](https://huggingface.co/Senat1/dmx-wan-1.3b) |
+| Wan 2.2 14B Shard 6 | 7.2 GB | 1.5 GB | 79.5% | 142/142 tensors | [Download](https://huggingface.co/Senat1/dmx-wan2.2-shard6) |
+| SVD-XT (Video) | 9.1 GB | 1.8 GB | 80% | Roundtrip verified | [Download](https://huggingface.co/Senat1/dmx-svd-xt) |
+Each includes a JSON verification report with SHA-256 hashes and per-tensor cosine similarity scores.
+## Format Specification
+See [spec/dmx_spec_v1.md](spec/dmx_spec_v1.md) for the complete format specification.
+## Paper
+[DMX: Delta Multiplexed Compression for Neural Network Model Weights (PDF)](https://github.com/willjriley/dmx/raw/main/paper/DMX_Paper.pdf) — click to download
+## Background
+DMX is based on the principle that floating-point weights should be transformed into multiple statistically distinct, independently modeled entropy domains prior to compression. Trained neural network weights exhibit extreme exponent clustering — 74% of FP16 values use only 3 of 31 possible exponents, wasting 2.4 bits per value. DMX decomposes the floating-point representation into separate exponent and mantissa streams, each with distinct statistical properties that benefit from independent entropy coding. For FP32 models, aligned cross-layer quantization enforces a global coordinate system across layers, enabling additional integer-domain compression. The format auto-profiles each model to select the optimal compression strategy per component.
+## Validated Results: Checkpoint Delta Compression
+All results are measured on real data using an NVIDIA A100-SXM4-80GB. Scripts are in `experiments/checkpoint_delta/`.
+### Compression across architectures
+| Model | Architecture | Params | Consecutive Delta Zeros | Entropy (bits) | Measured Savings |
+|-------|-------------|-------:|:----------------------:|:--------------:|:----------------:|
+| GPT-2 | Decoder-only | 163M | 33-67% | 1.76-3.02 | **87.3%** (measured, 498→63 MB) |
+| T5-small | Encoder-decoder | 110M | **89-94%** | **0.49-0.85** | Not yet measured in bytes |
+| TinyLlama | Decoder-only | 1.1B | — | — | **80%** (measured, fine-tune base→chat) |
+Delta compression works across model architectures. T5 encoder-decoder shows higher sparsity than decoder-only models. Real-byte compression for T5 is pending.
+### Precision tiers
+Both tiers achieve comparable compression — the aligned quantization produces similar entropy regardless of bit width:
+| Tier | Consecutive Entropy | Compression | Error | Use Case |
+|------|-------------------:|:----------:|:-----:|----------|
+| **int16 aligned** | 0.6-1.3 bits | **87%** | +0.06% RelL2 | Maximum compression |
+| **int32 aligned** | 1.0-1.2 bits | **~87%** | 1.87e-7 RelL2 | Practically lossless (error below FP32 noise floor) |
+| Raw bit XOR (no alignment) | 14-16 bits | 8.5% | Bit-exact | Baseline — alignment is essential |
+### Full checkpoint including optimizer states
+Training checkpoints include model weights + Adam optimizer states (momentum + variance), typically 3x the weight size. Validated on GPT-2 124M, 1000 training steps:
+| Component | % of Checkpoint | Delta Sparsity | Entropy | Compression |
+|-----------|:-:|:-:|:-:|:-:|
+| Weights | 33% | 55-66% zeros | 1.8-2.6 bits | ~84% |
+| Momentum (exp_avg) | 33% | 28-30% zeros | 7.5-9.0 bits | ~53% |
+| Variance (exp_avg_sq) | 33% | **91-92% zeros** | **0.6 bits** | **~96%** |
+| **Full checkpoint** | 100% | — | — | **~79%** |
+### Safety for training resumption
+Training from a DMX-reconstructed checkpoint produces **0.042% loss difference** compared to the original — negligible for any practical purpose:
+```
+Step |   Original |  DMX Recon |       Diff
+    1 |   0.783582 |   0.784023 | 0.00044072
+   51 |   1.098088 |   1.098552 | 0.00046420
+   91 |   0.537082 |   0.537364 | 0.00028241
+Final avg loss (last 20 steps): 0.042% difference
+```
+### Zero error accumulation in delta chains (Test 8)
+Chained reconstruction (base → delta1 → delta2 → ... → deltaN) produces **identical** results to direct reconstruction (base + deltaN) — verified to 10 decimal places across both int16 and int32 modes. This is not an approximation: delta application is exact integer arithmetic, so error is mathematically constant regardless of chain length. Re-anchoring is needed only for delta *size* control, never for error control.
+### Fine-tune variant compression
+TinyLlama 1.1B base → chat fine-tune: **80% savings** (876 MB delta vs 4.4 GB full copy). Store the base model once, distribute each fine-tune variant as a small delta.
+### Projected savings at frontier scale
+These are **projections** extrapolated from observed sparsity and scaling behavior across GPT-2 (163M), T5 (110M), and TinyLlama (1.1B). The underlying property — small per-step weight updates due to SGD dynamics — is scale-invariant, but real-byte validation at 70B+ scale is in progress.
+| Scenario | Raw Storage | Projected DMX | Projected Savings |
+|----------|------------|--------------|:-----------------:|
+| 200 checkpoints of Llama 70B (weights only) | 28 TB | ~3.6 TB | ~87% |
+| 200 checkpoints of Llama 70B (full w/ optimizer) | 84 TB | ~18 TB | ~79% |
+| 20 fine-tune variants of Llama 70B | 2.8 TB | ~700 GB | ~75% |
+**Caveats**: Validation on 8B+ models with frontier training schedules is in progress. Momentum compression (53%) was measured on wikitext-2; diverse training data may yield 40-45%, reducing full-checkpoint savings to ~70-73%. The 87% weight compression is measured on GPT-2; larger models may differ.
+---
+## Research Directions
+DMX's underlying compression technique applies to structured floating-point data beyond individual model files. These are active research areas, not yet proven at scale. We welcome collaboration.
+**1. Training checkpoint compression (highest priority).** Frontier training produces hundreds of near-identical high-precision checkpoints. Aligned cross-layer quantization enables efficient delta encoding between them. Early results are in the [Validated Results](#validated-results-checkpoint-delta-compression) section above. Key finding: alignment is critical — without it, deltas show almost no sparsity and compress poorly.
+**2. Model family distribution.** Storing fine-tuned variants (chat, code, reasoning, etc.) as small deltas from a shared base model. Early result: TinyLlama base → chat = 80% savings (876 MB vs 4.4 GB).
+**3. Scientific and sensor data.** Early tests on NOAA weather data show similar exponent clustering, suggesting potential applications in climate, seismic, and satellite data.
+## License & Patent
+**Code:** MIT License — free to use, modify, and distribute.
+**Methods:** Patent Pending (U.S. Provisional Applications filed April 2026). The patented methods cover aligned cross-layer quantization for neural network weight compression and stream-separated block floating point encoding with independent entropy coding. Personal, academic, and open-source use is unrestricted. Commercial use of the patented methods may require a license from the inventor — contact bill.riley@gmail.com.
+## Citation
+```bibtex
+@software{riley2026dmx,
+  author = {Riley, William J},
+  title = {DMX: Delta Multiplexed Model Format},
+  year = {2026},
+  url = {https://github.com/willjriley/dmx}
+}
+```