castkit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. castkit-0.1.0/.claude/settings.local.json +14 -0
  2. castkit-0.1.0/CLAUDE.md +72 -0
  3. castkit-0.1.0/LICENSE +21 -0
  4. castkit-0.1.0/PKG-INFO +44 -0
  5. castkit-0.1.0/README.md +47 -0
  6. castkit-0.1.0/castkit.toml.example +33 -0
  7. castkit-0.1.0/docs/cli-architecture.md +151 -0
  8. castkit-0.1.0/docs/conversion-paths.md +178 -0
  9. castkit-0.1.0/docs/dgx-spark-testing.md +96 -0
  10. castkit-0.1.0/docs/implementation-plan.md +311 -0
  11. castkit-0.1.0/docs/project-overview.md +83 -0
  12. castkit-0.1.0/docs/quantization-formats.md +188 -0
  13. castkit-0.1.0/docs/trends-and-benchmarks.md +120 -0
  14. castkit-0.1.0/pyproject.toml +41 -0
  15. castkit-0.1.0/src/castkit/__init__.py +3 -0
  16. castkit-0.1.0/src/castkit/backends/__init__.py +0 -0
  17. castkit-0.1.0/src/castkit/backends/awq.py +472 -0
  18. castkit-0.1.0/src/castkit/backends/base.py +49 -0
  19. castkit-0.1.0/src/castkit/backends/gguf.py +482 -0
  20. castkit-0.1.0/src/castkit/backends/gptq.py +388 -0
  21. castkit-0.1.0/src/castkit/backends/mlx.py +562 -0
  22. castkit-0.1.0/src/castkit/cli/__init__.py +0 -0
  23. castkit-0.1.0/src/castkit/cli/main.py +846 -0
  24. castkit-0.1.0/src/castkit/core/__init__.py +0 -0
  25. castkit-0.1.0/src/castkit/core/config.py +40 -0
  26. castkit-0.1.0/src/castkit/core/dataset.py +32 -0
  27. castkit-0.1.0/src/castkit/core/download.py +36 -0
  28. castkit-0.1.0/src/castkit/core/metadata.py +51 -0
  29. castkit-0.1.0/src/castkit/core/model_info.py +294 -0
  30. castkit-0.1.0/src/castkit/core/utils.py +18 -0
  31. castkit-0.1.0/src/castkit/types.py +114 -0
  32. castkit-0.1.0/tests/__init__.py +0 -0
  33. castkit-0.1.0/tests/backends/__init__.py +0 -0
  34. castkit-0.1.0/tests/backends/test_awq.py +127 -0
  35. castkit-0.1.0/tests/backends/test_gguf.py +125 -0
  36. castkit-0.1.0/tests/backends/test_gptq.py +121 -0
  37. castkit-0.1.0/tests/backends/test_mlx.py +101 -0
  38. castkit-0.1.0/tests/conftest.py +101 -0
  39. castkit-0.1.0/tests/test_cli.py +242 -0
  40. castkit-0.1.0/tests/test_config.py +58 -0
  41. castkit-0.1.0/tests/test_cross_format.py +99 -0
  42. castkit-0.1.0/tests/test_download.py +33 -0
  43. castkit-0.1.0/tests/test_metadata.py +139 -0
  44. castkit-0.1.0/tests/test_model_info.py +67 -0
  45. castkit-0.1.0/tests/test_types.py +113 -0
  46. castkit-0.1.0/tests/test_utils.py +27 -0
  47. castkit-0.1.0/uv.lock +2282 -0
@@ -0,0 +1,14 @@
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "Bash(codex exec:*)",
5
+ "Bash(hf whoami:*)",
6
+ "Bash(hf auth:*)",
7
+ "Bash(hf:*)",
8
+ "Bash(uvx ruff:*)",
9
+ "Bash(uvx pytest tests/test_metadata.py tests/test_cross_format.py -v)",
10
+ "Bash(uv pip:*)",
11
+ "Bash(uv build:*)"
12
+ ]
13
+ }
14
+ }
@@ -0,0 +1,72 @@
1
+ # castkit
2
+
3
+ Universal model quantization and format conversion CLI.
4
+
5
+ ## Quick Reference
6
+
7
+ - Language: Python 3.12+
8
+ - Package manager: uv
9
+ - CLI framework: typer
10
+ - Config format: TOML
11
+ - Linter/Formatter: ruff
12
+ - Test: pytest (uv run pytest)
13
+ - Build: hatch (via pyproject.toml)
14
+
15
+ ## Commands
16
+
17
+ ```
18
+ uv sync # install dependencies
19
+ uv sync --extra gguf # install with GGUF support
20
+ uv sync --extra mlx # install with MLX support
21
+ uv sync --extra dev # install dev tools (ruff, pytest)
22
+ uv sync --all-extras # install everything
23
+ uv run pytest # run tests
24
+ uv run ruff format src/ tests/ # format
25
+ uv run ruff check src/ tests/ # lint
26
+ uv run castkit --help # run CLI
27
+ ```
28
+
29
+ ## Architecture
30
+
31
+ - src/castkit/cli/ - CLI commands (typer)
32
+ - src/castkit/core/ - shared utilities (config, download, model info, metadata)
33
+ - src/castkit/backends/ - format-specific backends (gguf.py, mlx.py, awq.py, gptq.py)
34
+ - tests/ - pytest tests
35
+
36
+ Each backend implements the abstract Backend interface (backends/base.py). New format support = new backend file.
37
+
38
+ ## Implementation Plan
39
+
40
+ See docs/implementation-plan.md for historical reference. All 4 phases are complete:
41
+
42
+ 1. Phase 1: Core + GGUF backend
43
+ 2. Phase 2: MLX backend
44
+ 3. Phase 3: AWQ + GPTQ backends
45
+ 4. Phase 4: Config recipes, perplexity measurement, batch conversion, cross-format conversion
46
+
47
+ ## Project-specific Rules
48
+
49
+ - Dependencies are split via extras: castkit[gguf], castkit[mlx], castkit[all]
50
+ - Backend imports must be lazy (import inside function) to avoid ImportError when extras are not installed
51
+ - MLX backend: Apple Silicon only. Check platform at import time.
52
+ - AWQ/GPTQ backends: use GPTQModel library. convert requires NVIDIA GPU, decast/info work on CPU/Apple Silicon.
53
+ - GGUF backend: shells out to llama-quantize/convert_hf_to_gguf.py. Detect presence at runtime.
54
+ - Use ruff for formatting and linting (not oxfmt/oxlint - those are for TypeScript)
55
+ - Type hints on all public functions
56
+ - Error messages must be actionable (tell the user what to do next)
57
+
58
+ ## Gotchas
59
+
60
+ - `uv sync --extra mlx` は dev deps (ruff, pytest) を外す。lint/test が必要なら `uv sync --extra mlx --extra dev` するか、`uvx ruff` / `uvx pytest` を使う
61
+ - mlx-lm の upload 機能は内部関数 (`utils.upload_to_hub`) で公開 API ではない。castkit では `huggingface_hub.HfApi` で直接アップロードしている
62
+ - HF upload は `api.upload_folder` を使用。`api.upload_large_folder` もあるが小〜中規模モデルでは `upload_folder` で十分
63
+ - homebrew の llama.cpp (convert_hf_to_gguf.py) は PyPI の gguf パッケージより新しいバージョンを要求する場合がある。GGUF convert が ImportError で失敗したら `uv pip install 'gguf @ git+https://github.com/ggml-org/llama.cpp#subdirectory=gguf-py'` で更新する
64
+
65
+ ## Reference Docs
66
+
67
+ - docs/project-overview.md - project vision, CLI design, competitive landscape
68
+ - docs/implementation-plan.md - decisions, package structure, implementation details
69
+ - docs/quantization-formats.md - all quantization format specs
70
+ - docs/conversion-paths.md - format conversion matrix and paths
71
+ - docs/trends-and-benchmarks.md - industry trends, benchmark data
72
+ - docs/cli-architecture.md - existing tool architectures, dependency info
castkit-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 schroneko
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
castkit-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,44 @@
1
+ Metadata-Version: 2.4
2
+ Name: castkit
3
+ Version: 0.1.0
4
+ Summary: Universal model quantization and format conversion CLI
5
+ Project-URL: Repository, https://github.com/schroneko/castkit
6
+ Author: schroneko
7
+ License-Expression: MIT
8
+ License-File: LICENSE
9
+ Requires-Python: >=3.12
10
+ Requires-Dist: huggingface-hub>=0.25
11
+ Requires-Dist: rich>=13
12
+ Requires-Dist: typer>=0.15
13
+ Provides-Extra: all
14
+ Requires-Dist: accelerate>=1.0; extra == 'all'
15
+ Requires-Dist: datasets>=3.0; extra == 'all'
16
+ Requires-Dist: gguf>=0.10; extra == 'all'
17
+ Requires-Dist: gptqmodel>=2.0; extra == 'all'
18
+ Requires-Dist: mlx-lm>=0.20; extra == 'all'
19
+ Requires-Dist: mlx>=0.22; extra == 'all'
20
+ Requires-Dist: numpy>=1.26; extra == 'all'
21
+ Requires-Dist: safetensors>=0.4; extra == 'all'
22
+ Requires-Dist: sentencepiece>=0.2; extra == 'all'
23
+ Requires-Dist: torch>=2.4; extra == 'all'
24
+ Requires-Dist: transformers>=4.45; extra == 'all'
25
+ Provides-Extra: dev
26
+ Requires-Dist: pytest>=8; extra == 'dev'
27
+ Requires-Dist: ruff>=0.9; extra == 'dev'
28
+ Provides-Extra: gguf
29
+ Requires-Dist: gguf>=0.10; extra == 'gguf'
30
+ Requires-Dist: numpy>=1.26; extra == 'gguf'
31
+ Requires-Dist: safetensors>=0.4; extra == 'gguf'
32
+ Requires-Dist: sentencepiece>=0.2; extra == 'gguf'
33
+ Requires-Dist: torch>=2.4; extra == 'gguf'
34
+ Requires-Dist: transformers>=4.45; extra == 'gguf'
35
+ Provides-Extra: gptq
36
+ Requires-Dist: accelerate>=1.0; extra == 'gptq'
37
+ Requires-Dist: datasets>=3.0; extra == 'gptq'
38
+ Requires-Dist: gptqmodel>=2.0; extra == 'gptq'
39
+ Requires-Dist: safetensors>=0.4; extra == 'gptq'
40
+ Requires-Dist: torch>=2.4; extra == 'gptq'
41
+ Requires-Dist: transformers>=4.45; extra == 'gptq'
42
+ Provides-Extra: mlx
43
+ Requires-Dist: mlx-lm>=0.20; extra == 'mlx'
44
+ Requires-Dist: mlx>=0.22; extra == 'mlx'
@@ -0,0 +1,47 @@
1
+ # castkit
2
+
3
+ castkit is a CLI tool for model quantization and format conversion across GGUF, MLX, GPTQ, and AWQ workflows, including cross-format conversion via automatic FP16 decast.
4
+
5
+ ## Installation
6
+
7
+ ### Homebrew (macOS)
8
+
9
+ ```bash
10
+ brew install schroneko/castkit/castkit
11
+ ```
12
+
13
+ ### pip / uv
14
+
15
+ ```bash
16
+ uv tool install castkit # core only
17
+ uv tool install castkit[mlx] # MLX backend (Apple Silicon)
18
+ uv tool install castkit[gguf] # GGUF backend (requires torch)
19
+ uv tool install castkit[all] # all backends
20
+ ```
21
+
22
+ ## Quick Start
23
+
24
+ ```bash
25
+ # convert
26
+ castkit convert Qwen/Qwen3-0.6B -f gguf -q q4_k_m -o ./output/Qwen3-0.6B.gguf
27
+
28
+ # decast (dequantize back to FP16 SafeTensors)
29
+ castkit decast ./output/Qwen3-0.6B.gguf -o ./output/Qwen3-0.6B-fp16
30
+
31
+ # model info
32
+ castkit info ./output/Qwen3-0.6B.gguf
33
+ ```
34
+
35
+ ## Supported Formats
36
+
37
+ | Format | Convert | Decast | Info | Measure |
38
+ | -------------- | ------------------ | ------ | ---- | ----------------- |
39
+ | GGUF | Yes | Yes | Yes | Yes |
40
+ | MLX | Yes | Yes | Yes | Yes |
41
+ | GPTQ | Yes | Yes | Yes | Yes |
42
+ | AWQ | Yes | Yes | Yes | Yes |
43
+ | FP16/BF16/FP32 | Input/Intermediate | N/A | Yes | Backend-dependent |
44
+
45
+ ## License
46
+
47
+ MIT
@@ -0,0 +1,33 @@
1
+ [default]
2
+ output_dir = "./output"
3
+
4
+ [recipes.gguf-standard]
5
+ format = "gguf"
6
+ quant = "q4_k_m"
7
+ imatrix = true
8
+ imatrix_data = "calibration.txt"
9
+
10
+ [recipes.gguf-all]
11
+ format = "gguf"
12
+ quant = ["q8_0", "q6_k", "q5_k_m", "q4_k_m", "q3_k_m", "q2_k"]
13
+ imatrix = true
14
+
15
+ [recipes.mlx-4bit]
16
+ format = "mlx"
17
+ bits = 4
18
+ group_size = 64
19
+
20
+ [recipes.mlx-8bit]
21
+ format = "mlx"
22
+ bits = 8
23
+ group_size = 64
24
+
25
+ [recipes.awq-4bit]
26
+ format = "awq"
27
+ bits = 4
28
+ group_size = 128
29
+
30
+ [recipes.gptq-4bit]
31
+ format = "gptq"
32
+ bits = 4
33
+ group_size = 128
@@ -0,0 +1,151 @@
1
+ ## Existing Tool Architectures
2
+
3
+ ### llama.cpp Pipeline
4
+
5
+ Two-step process:
6
+
7
+ 1. convert_hf_to_gguf.py (Python): HF model -> F16 GGUF
8
+ - Architecture detection via config.json
9
+ - Model class instantiation via @register decorator (80+ architectures)
10
+ - Lazy tensor loading (no memory consumption during indexing)
11
+ - TensorNameMap translates HF names -> GGUF names
12
+ - Supports SafeTensors (mmap), PyTorch, remote HF streaming
13
+ - Dependencies: numpy, torch, safetensors, sentencepiece, gguf
14
+
15
+ 2. llama-quantize (C++): F16 GGUF -> quantized GGUF
16
+ - Sequential tensor processing with mmap
17
+ - imatrix support via --imatrix flag
18
+ - Per-tensor type control via --tensor-type REGEX:TYPE
19
+ - Can process any model size on CPU (no GPU needed for basic quant)
20
+
21
+ ### mlx-lm convert
22
+
23
+ Single-step Python:
24
+
25
+ - Downloads HF model via transformers/huggingface_hub
26
+ - Maps weights to MLX arrays
27
+ - Optional quantization: to_quantized(group_size, bits) on Linear/Embedding layers
28
+ - Outputs SafeTensors + config.json
29
+ - --upload-repo for direct HF Hub upload
30
+ - Dependencies: mlx, mlx-lm, transformers, huggingface_hub, safetensors
31
+ - Apple Silicon only
32
+
33
+ ### GPTQModel
34
+
35
+ Python library + CLI:
36
+
37
+ - Layer-by-layer Hessian-based optimization
38
+ - Multi-GPU data-parallel quantization
39
+ - Supports GPTQ, AWQ, QQQ, GPTAQ, EoRA, GAR methods
40
+ - Dynamic per-module mixed quantization
41
+ - Inference kernels: Marlin (fastest), ExLlama V2/V1, Triton, Torch, BitBLAS
42
+
43
+ ### AutoAWQ (archived May 2025)
44
+
45
+ Python library:
46
+
47
+ - Activation analysis -> salient channel identification -> per-channel scaling
48
+ - 10-30 min for 7B (much faster than GPTQ)
49
+ - Replaced by llm-compressor (AWQModifier) and GPTQModel
50
+
51
+ ### ExLlamaV2 (EXL2)
52
+
53
+ Two-phase Python:
54
+
55
+ 1. Measurement: model quantized ~12 times with different params, error measured per layer. Saved to measurement.json.
56
+ 2. Quantization: optimizer selects per-layer bit allocations to minimize total error at target bpw.
57
+
58
+ - Memory: only largest transformer layer must fit in VRAM
59
+ - 7B: ~16 GB RAM, ~8 GB VRAM. 70B: ~64 GB RAM, ~24 GB VRAM.
60
+
61
+ ### quantkit (llm-quantkit)
62
+
63
+ Thin Python CLI wrapper:
64
+
65
+ ```
66
+ quantkit/
67
+ cli.py # CLI entry point
68
+ quantkit.py # Core orchestration
69
+ convert.py # General conversion logic
70
+ convert_exl2.py # EXL2-specific
71
+ convert_hf.py # HuggingFace download/conversion
72
+ safetensor.py # SafeTensor utilities
73
+ ```
74
+
75
+ - Delegates to respective libraries for each format
76
+ - Limitations: no MLX, no ONNX, limited config exposure, Python 3.12+ issues, no imatrix generation, no batch multi-format quantization
77
+
78
+ ## Memory Requirements for Quantization
79
+
80
+ Rule of thumb: FP16 model requires ~2 GB per billion parameters.
81
+
82
+ | Model Size | FP16 Size | GPTQ/AWQ VRAM | GGUF (CPU) | EXL2 VRAM |
83
+ | ---------- | --------- | -------------- | ---------- | ----------------------- |
84
+ | 7B | ~14 GB | ~16-24 GB | CPU only | ~8 GB VRAM + 16 GB RAM |
85
+ | 13B | ~26 GB | ~32-48 GB | CPU only | ~16 GB VRAM + 32 GB RAM |
86
+ | 34B | ~68 GB | Fails on 24 GB | CPU only | ~24 GB VRAM + 64 GB RAM |
87
+ | 70B | ~140 GB | Fails on 24 GB | CPU only | ~24 GB VRAM + 64 GB RAM |
88
+
89
+ ## Common Dependencies
90
+
91
+ Core (required by almost all):
92
+
93
+ - torch/pytorch: tensor operations, model loading
94
+ - transformers: model architectures, tokenizers, from_pretrained
95
+ - safetensors: efficient weight storage (standard on HF)
96
+ - huggingface_hub: model downloading (snapshot_download, hf_hub_download)
97
+ - numpy: numerical operations
98
+ - sentencepiece: tokenizer (LLaMA, Gemma, etc.)
99
+ - accelerate: multi-GPU model loading and sharding
100
+ - datasets: calibration dataset loading
101
+
102
+ Format-specific:
103
+
104
+ - gguf (from llama.cpp): GGUF read/write
105
+ - gptqmodel: GPTQ quantization
106
+ - autoawq / llm-compressor: AWQ quantization
107
+ - exllamav2: EXL2 quantization
108
+ - bitsandbytes: NF4/INT8 runtime
109
+ - optimum / optimum-onnx: ONNX export
110
+ - onnxruntime: ONNX inference/quantization
111
+ - hqq: Half-Quadratic Quantization
112
+ - mlx, mlx-lm: MLX format (Apple Silicon only)
113
+
114
+ ## Model Download Patterns
115
+
116
+ Two approaches:
117
+
118
+ 1. transformers.AutoModelForCausalLM.from_pretrained(): loads into memory. Used by GPTQ/AWQ that need model for calibration.
119
+ 2. huggingface_hub.snapshot_download(): downloads to cache without loading. Used by llama.cpp and file-based tools.
120
+
121
+ HF cache: ~/.cache/huggingface/hub/ with content-addressable storage + symlinks.
122
+
123
+ ## HuggingFace Transformers Quantization Integration
124
+
125
+ | Format | Loading | Backend | Auto-detection |
126
+ | ------------------ | --------------------------------------------------------- | ------------------ | -------------------- |
127
+ | GPTQ | from_pretrained() | GPTQModel | quantize_config.json |
128
+ | AWQ | from_pretrained() | AutoAWQ | config |
129
+ | BnB | from_pretrained(quantization_config=BitsAndBytesConfig()) | bitsandbytes | explicit config |
130
+ | HQQ | from_pretrained(quantization_config=HqqConfig()) | HQQ | explicit config |
131
+ | GGUF | from_pretrained(gguf_file="model.gguf") | gguf-py | explicit param |
132
+ | compressed-tensors | from_pretrained() | compressed-tensors | auto |
133
+ | EXL2/EXL3 | NOT supported | ExLlama only | - |
134
+
135
+ Note: transformers loads GGUF by dequantizing to FP32. Uses same memory as FP16 model. For fine-tuning/conversion only, not inference.
136
+
137
+ ## Design Considerations for castkit
138
+
139
+ 1. Format-specific backends are fundamentally different: GGUF is CPU-based C++ (fast, no GPU), GPTQ/AWQ require GPU + calibration, EXL2 has two-phase flow, ONNX has its own pipeline. Must orchestrate as separate backends.
140
+
141
+ 2. Calibration divides UX: GGUF quantization takes minutes with no calibration. GPTQ/AWQ/EXL2 take hours with calibration data. Surface this difference clearly.
142
+
143
+ 3. Memory constraints vary: GGUF can quantize any model on CPU. GPTQ/AWQ fail for models larger than VRAM. EXL2 only needs largest layer to fit. Surface constraints upfront.
144
+
145
+ 4. quantkit's thin-wrapper approach works but is limited. Room for improvement: MLX support, ONNX support, imatrix generation, batch multi-format output, better config exposure.
146
+
147
+ 5. Ecosystem consolidation: AutoAWQ archived, GPTQModel expanding, llm-compressor for vLLM. Track these shifts.
148
+
149
+ 6. Common infrastructure: HF download, model metadata parsing, output upload. Natural abstraction layer.
150
+
151
+ 7. Intel AutoRound can export to multiple formats from single quantization run - consider similar multi-output capability.
@@ -0,0 +1,178 @@
1
+ # Conversion Paths Reference
2
+
3
+ All conversion paths between quantization formats, with tooling and quality implications.
4
+
5
+ ## Terminology
6
+
7
+ - GGUF: self-contained binary file format (.gguf) by llama.cpp
8
+ - SafeTensors: secure tensor serialization container (.safetensors) by HuggingFace. Standard container for HF-ecosystem models.
9
+ - GPTQ, AWQ, EXL2/EXL3, HQQ, bitsandbytes: quantization algorithms/methods. Output stored in SafeTensors with metadata JSON.
10
+ - MLX: Apple Silicon native format using SafeTensors-style storage.
11
+
12
+ ---
13
+
14
+ ## Direct Paths (FP16/BF16 source)
15
+
16
+ All quantization methods are designed to take full-precision as input. This is the golden path.
17
+
18
+ | Target | Tool | Time (7B) | Calibration |
19
+ | ------------------ | -------------------------------------- | ------------- | ------------------------- |
20
+ | GGUF | convert_hf_to_gguf.py + llama-quantize | Minutes | Optional (imatrix) |
21
+ | GPTQ | GPTQModel | 2-4 hours | 128 samples from C4 |
22
+ | AWQ | AutoAWQ / llm-compressor | 10-30 min | Small dataset |
23
+ | EXL2 | ExLlamaV2 | Hours | Yes (measurement phase) |
24
+ | EXL3 | ExLlamaV3 | Hours | Yes (trellis calibration) |
25
+ | BnB NF4/INT8 | bitsandbytes | Runtime | None |
26
+ | HQQ | HQQ | <5 min (70B) | None |
27
+ | MLX | mlx_lm.convert | Minutes | None (affine) |
28
+ | ONNX | HF Optimum / ONNX Runtime | Minutes-Hours | Static: Yes |
29
+ | compressed-tensors | llm-compressor | Varies | Yes |
30
+ | FP8 | llm-compressor / TensorRT | Fast | Optional |
31
+
32
+ ---
33
+
34
+ ## Cross-Format Direct Conversions
35
+
36
+ | Source -> Target | Tool | Notes |
37
+ | ------------------------ | ------------------------------- | ------------------------------------------------------------------------------ |
38
+ | GPTQ -> GGUF | llama.cpp convert_hf_to_gguf.py | Dequantizes GPTQ internally, then converts. Detects GPTQ config automatically. |
39
+ | AWQ -> GGUF | llama.cpp convert_hf_to_gguf.py | Requires export_compatible=True when creating AWQ model. Preserves AWQ scales. |
40
+ | GPTQ -> GGUF (optimized) | gptq-gguf-toolkit (IST-DASLab) | Better quality: applies GPTQ error correction during K-quant quantization. |
41
+ | GGUF -> MLX | gguf2mlx (community) | Dequantizes then re-encodes. Limited quant type support. |
42
+ | FP16 SafeTensors -> MLX | mlx_lm.convert | Official. Can quantize during conversion. |
43
+
44
+ ---
45
+
46
+ ## Indirect Paths (via FP16 intermediate)
47
+
48
+ Most cross-format conversions require dequantizing to FP16 first, then re-quantizing. This causes double precision loss.
49
+
50
+ AWQ -> GGUF (recommended):
51
+
52
+ 1. Quantize with AutoAWQ using export_compatible=True
53
+ 2. Convert with convert_hf_to_gguf.py (preserves AWQ scales)
54
+ 3. Quantize with llama-quantize
55
+
56
+ GPTQ <-> AWQ:
57
+
58
+ 1. Dequantize to FP16
59
+ 2. Re-quantize with target tool
60
+
61
+ - Much better to start from original FP16 if available
62
+
63
+ GGUF -> GPTQ/AWQ/EXL2:
64
+
65
+ 1. Dequantize GGUF to FP16 SafeTensors (gguf-py dequantize() or gguf_to_safetensors)
66
+ 2. Re-quantize with target tool
67
+
68
+ - Quality depends heavily on source GGUF bit width (Q8_0 OK, Q4 noticeable loss)
69
+
70
+ BnB NF4 -> any:
71
+
72
+ 1. model.dequantize() in transformers
73
+ 2. Save as SafeTensors
74
+ 3. Convert to target
75
+
76
+ Any -> MLX:
77
+
78
+ 1. Get to FP16 SafeTensors
79
+ 2. mlx_lm.convert with optional quantization flags
80
+
81
+ ---
82
+
83
+ ## Re-quantization Within Same Format
84
+
85
+ | Format | Tool | Notes |
86
+ | -------------------------- | --------------------------------- | ---------------------------------------------------------------- |
87
+ | GGUF (e.g. Q8_0 -> Q4_K_M) | llama-quantize --allow-requantize | Quality degrades vs quantizing from FP16. Tool warns about this. |
88
+ | GPTQ/AWQ/EXL2/BnB | None | Must dequantize to FP16 first, then re-quantize. |
89
+
90
+ ---
91
+
92
+ ## Dequantization (Quantized -> FP16)
93
+
94
+ All dequantization is lossy - the original precision is permanently lost.
95
+
96
+ | Format | Tool | Quality (vs original) |
97
+ | ------------- | --------------------------------- | -------------------------------------------------------- |
98
+ | GGUF Q8_0 | gguf-py dequantize() | Very close (~0.1% PPL diff) |
99
+ | GGUF Q6_K | Same | Minor degradation |
100
+ | GGUF Q4_K_M | Same | Noticeable (~1-3% PPL increase) |
101
+ | GGUF Q2_K/IQ2 | Same | Severe - not useful for re-quantization |
102
+ | GPTQ 4-bit | GPTQModel / transformers | Moderate - rounding errors visible |
103
+ | AWQ 4-bit | Load + extract weights | Similar to GPTQ |
104
+ | EXL2/EXL3 | ExLlama loader | Depends on bpw (6+ bpw very close) |
105
+ | BnB NF4 | model.dequantize() | Good for 4-bit (NF4 is optimal for normal distributions) |
106
+ | BnB INT8 | Same | Very close to original |
107
+ | HQQ | Native dequant (linear operation) | Depends on bit width |
108
+
109
+ ---
110
+
111
+ ## Impossible / Impractical Conversions
112
+
113
+ - GPTQ <-> AWQ (direct): fundamentally different algorithms. No mathematical mapping.
114
+ - EXL2 <-> EXL3 (direct): different quantization schemes (GPTQ-based vs QTIP-based).
115
+ - Any quantized -> original FP16: information permanently lost.
116
+ - GGUF Q2_K -> any high-quality format: precision loss too severe.
117
+ - BnB NF4 -> GGUF/GPTQ (direct): BnB is runtime library, must dequantize first.
118
+ - ONNX quantized -> GGUF/GPTQ (direct): completely separate ecosystems.
119
+ - 4-bit -> 8-bit for quality improvement: higher bit width does not recover lost information.
120
+
121
+ ---
122
+
123
+ ## Conversion Matrix
124
+
125
+ ```
126
+ Source -> | FP16 | GGUF | GPTQ | AWQ | EXL2 | EXL3 | BnB | HQQ | MLX | ONNX
127
+ -----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|------
128
+ FP16 | - | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes
129
+ GGUF | Deq | Re* | Indir | Indir | Indir | Indir | Indir | Indir | Deq | Indir
130
+ GPTQ | Deq | Yes** | - | Indir | Indir | Indir | Indir | Indir | Indir | Indir
131
+ AWQ | Deq | Yes***| Indir | - | Indir | Indir | Indir | Indir | Indir | Indir
132
+ EXL2 | Deq | Indir | Indir | Indir | - | No | Indir | Indir | Indir | Indir
133
+ EXL3 | Deq | Indir | Indir | Indir | No | - | Indir | Indir | Indir | Indir
134
+ BnB | Deq | Indir | Indir | Indir | Indir | Indir | - | Indir | Indir | Indir
135
+ HQQ | Deq | Indir | Indir | Indir | Indir | Indir | Indir | - | Indir | Indir
136
+ MLX | Deq | Indir | Indir | Indir | Indir | Indir | Indir | Indir | - | Indir
137
+ ONNX | Deq | Indir | Indir | Indir | Indir | Indir | Indir | Indir | Indir | -
138
+ ```
139
+
140
+ Legend:
141
+
142
+ - Yes = direct, well-supported path
143
+ - Deq = dequantize to approximate FP16 (lossy)
144
+ - Re\* = GGUF-to-GGUF requantize with --allow-requantize
145
+ - Yes\*\* = llama.cpp handles GPTQ dequant internally
146
+ - Yes\*\*\* = via AWQ export_compatible=True path
147
+ - Indir = indirect, requires dequantize-to-FP16 then re-quantize (double loss)
148
+ - No = not feasible (incompatible algorithms)
149
+
150
+ ---
151
+
152
+ ## Key Tools Reference
153
+
154
+ | Tool | Repository | Purpose |
155
+ | ----------------- | ----------------------------------------- | ------------------------------------------------ |
156
+ | llama.cpp | ggml-org/llama.cpp | GGUF conversion and quantization |
157
+ | GPTQModel | ModelCloud/GPTQModel | GPTQ/AWQ/GPTAQ quantization |
158
+ | AutoAWQ | casper-hansen/AutoAWQ (archived May 2025) | AWQ quantization |
159
+ | ExLlamaV2 | turboderp-org/exllamav2 | EXL2 quantization and inference |
160
+ | ExLlamaV3 | turboderp-org/exllamav3 | EXL3 quantization and inference |
161
+ | HQQ | dropbox/hqq | Half-Quadratic Quantization |
162
+ | quantkit | xhedit/quantkit | Multi-format CLI (GGUF, GPTQ, AWQ, HQQ, EXL2) |
163
+ | Intel AutoRound | intel/auto-round | Multi-format export (GPTQ, AWQ, GGUF, AutoRound) |
164
+ | llm-compressor | vllm-project/llm-compressor | compressed-tensors for vLLM |
165
+ | gptq-gguf-toolkit | IST-DASLab/gptq-gguf-toolkit | GPTQ-optimized GGUF quantization |
166
+ | mlx-lm | ml-explore/mlx-lm | MLX format conversion and quantization |
167
+ | bitsandbytes | bitsandbytes-foundation/bitsandbytes | Runtime INT8/NF4 quantization |
168
+ | HF Optimum | huggingface/optimum-onnx | ONNX export and quantization |
169
+
170
+ ---
171
+
172
+ ## Practical Recommendations
173
+
174
+ 1. Always keep the original FP16/BF16 model. It is the best source for any quantization.
175
+ 2. Intel AutoRound can export to GPTQ, AWQ, GGUF, and AutoRound from a single run.
176
+ 3. For GGUF, best quality: FP16 -> F16 GGUF -> llama-quantize with imatrix.
177
+ 4. For GPTQ-to-GGUF, gptq-gguf-toolkit produces better results than naive conversion.
178
+ 5. 8-bit formats (FP8, INT8, Q8_0) are near-lossless and safest for intermediate conversion.
@@ -0,0 +1,96 @@
1
+ # DGX Spark Testing Guide
2
+
3
+ ## 1. Prerequisites
4
+
5
+ - Python 3.12+
6
+ - CUDA 12.8+
7
+ - `uv`
8
+ - NVIDIA DGX Spark environment (ARM64)
9
+
10
+ ## 2. Setup
11
+
12
+ ```bash
13
+ git clone https://github.com/schroneko/castkit.git
14
+ cd castkit
15
+ uv sync --extra gptq --extra gguf --extra dev
16
+ ```
17
+
18
+ Optional (for upload-related tests):
19
+
20
+ ```bash
21
+ uv add huggingface-hub
22
+ ```
23
+
24
+ ## 3. MLX Support Note
25
+
26
+ MLX is Apple Silicon focused and is not supported on DGX Spark. Skip MLX runtime quantization/decast tests on this platform.
27
+
28
+ ## 4. Test Phases
29
+
30
+ ### Phase 1: Basic Validation
31
+
32
+ ```bash
33
+ uv run pytest
34
+ uv run ruff check src/ tests/
35
+ uv run castkit --help
36
+ uv run castkit --version
37
+ ```
38
+
39
+ ### Phase 2: Decast (CPU)
40
+
41
+ No GPU is required.
42
+
43
+ ```bash
44
+ uv run castkit decast ./path/to/model-quantized -o ./output/decast-fp16
45
+ uv run castkit info ./output/decast-fp16
46
+ ```
47
+
48
+ ### Phase 3: GPU Quantization (GPTQ/AWQ)
49
+
50
+ ```bash
51
+ uv run castkit convert ./path/to/fp16-model -f gptq -b 4 -g 128 -o ./output/model-gptq
52
+ uv run castkit convert ./path/to/fp16-model -f awq -b 4 -g 128 -o ./output/model-awq
53
+ ```
54
+
55
+ ### Phase 4: GGUF Conversion
56
+
57
+ Requires `llama.cpp` build with `llama-quantize`, `llama-perplexity`, and `convert_hf_to_gguf.py`.
58
+
59
+ ```bash
60
+ uv run castkit convert ./path/to/fp16-model -f gguf -q q4_k_m -o ./output/model-q4_k_m.gguf
61
+ uv run castkit info ./output/model-q4_k_m.gguf
62
+ ```
63
+
64
+ ### Phase 5: Cross-Format Conversion
65
+
66
+ ```bash
67
+ uv run castkit convert ./output/model-q4_k_m.gguf -f gptq -b 4 -o ./output/model-gptq-from-gguf
68
+ ```
69
+
70
+ ### Phase 6: Perplexity Measurement
71
+
72
+ ```bash
73
+ uv run castkit measure ./output/model-q4_k_m.gguf --dataset wikitext-2 --max-samples 128
74
+ ```
75
+
76
+ ## 5. Recommended Test Models
77
+
78
+ - TinyLlama/TinyLlama-1.1B-Chat-v1.0
79
+ - Qwen/Qwen2.5-0.5B-Instruct
80
+
81
+ Use smaller models first to validate flow, then scale up.
82
+
83
+ ## 6. DGX Spark Specific Notes
84
+
85
+ - Platform is ARM64; validate wheel/ABI availability for all dependencies.
86
+ - `gptqmodel` may require local compilation steps depending on CUDA/toolchain state.
87
+ - Verify compatibility with Blackwell SM targets in CUDA extension builds.
88
+ - Prefer explicit CUDA environment variables when troubleshooting build/runtime mismatch.
89
+
90
+ ## 7. Troubleshooting
91
+
92
+ - `Backend '...' is not available`: missing extras or runtime dependencies. Re-run `uv sync --extra ...`.
93
+ - `Command not found: llama-quantize`: install/build `llama.cpp` and add binaries to `PATH`.
94
+ - `CUDA was not detected`: confirm driver/toolkit setup and `torch.cuda.is_available()`.
95
+ - `Could not parse perplexity`: capture full `llama-perplexity` stdout and verify output format.
96
+ - Slow/unstable tests: reduce model size, reduce `--max-samples`, and run phase-by-phase.