castkit 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- castkit-0.1.0/.claude/settings.local.json +14 -0
- castkit-0.1.0/CLAUDE.md +72 -0
- castkit-0.1.0/LICENSE +21 -0
- castkit-0.1.0/PKG-INFO +44 -0
- castkit-0.1.0/README.md +47 -0
- castkit-0.1.0/castkit.toml.example +33 -0
- castkit-0.1.0/docs/cli-architecture.md +151 -0
- castkit-0.1.0/docs/conversion-paths.md +178 -0
- castkit-0.1.0/docs/dgx-spark-testing.md +96 -0
- castkit-0.1.0/docs/implementation-plan.md +311 -0
- castkit-0.1.0/docs/project-overview.md +83 -0
- castkit-0.1.0/docs/quantization-formats.md +188 -0
- castkit-0.1.0/docs/trends-and-benchmarks.md +120 -0
- castkit-0.1.0/pyproject.toml +41 -0
- castkit-0.1.0/src/castkit/__init__.py +3 -0
- castkit-0.1.0/src/castkit/backends/__init__.py +0 -0
- castkit-0.1.0/src/castkit/backends/awq.py +472 -0
- castkit-0.1.0/src/castkit/backends/base.py +49 -0
- castkit-0.1.0/src/castkit/backends/gguf.py +482 -0
- castkit-0.1.0/src/castkit/backends/gptq.py +388 -0
- castkit-0.1.0/src/castkit/backends/mlx.py +562 -0
- castkit-0.1.0/src/castkit/cli/__init__.py +0 -0
- castkit-0.1.0/src/castkit/cli/main.py +846 -0
- castkit-0.1.0/src/castkit/core/__init__.py +0 -0
- castkit-0.1.0/src/castkit/core/config.py +40 -0
- castkit-0.1.0/src/castkit/core/dataset.py +32 -0
- castkit-0.1.0/src/castkit/core/download.py +36 -0
- castkit-0.1.0/src/castkit/core/metadata.py +51 -0
- castkit-0.1.0/src/castkit/core/model_info.py +294 -0
- castkit-0.1.0/src/castkit/core/utils.py +18 -0
- castkit-0.1.0/src/castkit/types.py +114 -0
- castkit-0.1.0/tests/__init__.py +0 -0
- castkit-0.1.0/tests/backends/__init__.py +0 -0
- castkit-0.1.0/tests/backends/test_awq.py +127 -0
- castkit-0.1.0/tests/backends/test_gguf.py +125 -0
- castkit-0.1.0/tests/backends/test_gptq.py +121 -0
- castkit-0.1.0/tests/backends/test_mlx.py +101 -0
- castkit-0.1.0/tests/conftest.py +101 -0
- castkit-0.1.0/tests/test_cli.py +242 -0
- castkit-0.1.0/tests/test_config.py +58 -0
- castkit-0.1.0/tests/test_cross_format.py +99 -0
- castkit-0.1.0/tests/test_download.py +33 -0
- castkit-0.1.0/tests/test_metadata.py +139 -0
- castkit-0.1.0/tests/test_model_info.py +67 -0
- castkit-0.1.0/tests/test_types.py +113 -0
- castkit-0.1.0/tests/test_utils.py +27 -0
- castkit-0.1.0/uv.lock +2282 -0
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
{
|
|
2
|
+
"permissions": {
|
|
3
|
+
"allow": [
|
|
4
|
+
"Bash(codex exec:*)",
|
|
5
|
+
"Bash(hf whoami:*)",
|
|
6
|
+
"Bash(hf auth:*)",
|
|
7
|
+
"Bash(hf:*)",
|
|
8
|
+
"Bash(uvx ruff:*)",
|
|
9
|
+
"Bash(uvx pytest tests/test_metadata.py tests/test_cross_format.py -v)",
|
|
10
|
+
"Bash(uv pip:*)",
|
|
11
|
+
"Bash(uv build:*)"
|
|
12
|
+
]
|
|
13
|
+
}
|
|
14
|
+
}
|
castkit-0.1.0/CLAUDE.md
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
# castkit
|
|
2
|
+
|
|
3
|
+
Universal model quantization and format conversion CLI.
|
|
4
|
+
|
|
5
|
+
## Quick Reference
|
|
6
|
+
|
|
7
|
+
- Language: Python 3.12+
|
|
8
|
+
- Package manager: uv
|
|
9
|
+
- CLI framework: typer
|
|
10
|
+
- Config format: TOML
|
|
11
|
+
- Linter/Formatter: ruff
|
|
12
|
+
- Test: pytest (uv run pytest)
|
|
13
|
+
- Build: hatch (via pyproject.toml)
|
|
14
|
+
|
|
15
|
+
## Commands
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
uv sync # install dependencies
|
|
19
|
+
uv sync --extra gguf # install with GGUF support
|
|
20
|
+
uv sync --extra mlx # install with MLX support
|
|
21
|
+
uv sync --extra dev # install dev tools (ruff, pytest)
|
|
22
|
+
uv sync --all-extras # install everything
|
|
23
|
+
uv run pytest # run tests
|
|
24
|
+
uv run ruff format src/ tests/ # format
|
|
25
|
+
uv run ruff check src/ tests/ # lint
|
|
26
|
+
uv run castkit --help # run CLI
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Architecture
|
|
30
|
+
|
|
31
|
+
- src/castkit/cli/ - CLI commands (typer)
|
|
32
|
+
- src/castkit/core/ - shared utilities (config, download, model info, metadata)
|
|
33
|
+
- src/castkit/backends/ - format-specific backends (gguf.py, mlx.py, awq.py, gptq.py)
|
|
34
|
+
- tests/ - pytest tests
|
|
35
|
+
|
|
36
|
+
Each backend implements the abstract Backend interface (backends/base.py). New format support = new backend file.
|
|
37
|
+
|
|
38
|
+
## Implementation Plan
|
|
39
|
+
|
|
40
|
+
See docs/implementation-plan.md for historical reference. All 4 phases are complete:
|
|
41
|
+
|
|
42
|
+
1. Phase 1: Core + GGUF backend
|
|
43
|
+
2. Phase 2: MLX backend
|
|
44
|
+
3. Phase 3: AWQ + GPTQ backends
|
|
45
|
+
4. Phase 4: Config recipes, perplexity measurement, batch conversion, cross-format conversion
|
|
46
|
+
|
|
47
|
+
## Project-specific Rules
|
|
48
|
+
|
|
49
|
+
- Dependencies are split via extras: castkit[gguf], castkit[mlx], castkit[all]
|
|
50
|
+
- Backend imports must be lazy (import inside function) to avoid ImportError when extras are not installed
|
|
51
|
+
- MLX backend: Apple Silicon only. Check platform at import time.
|
|
52
|
+
- AWQ/GPTQ backends: use GPTQModel library. convert requires NVIDIA GPU, decast/info work on CPU/Apple Silicon.
|
|
53
|
+
- GGUF backend: shells out to llama-quantize/convert_hf_to_gguf.py. Detect presence at runtime.
|
|
54
|
+
- Use ruff for formatting and linting (not oxfmt/oxlint - those are for TypeScript)
|
|
55
|
+
- Type hints on all public functions
|
|
56
|
+
- Error messages must be actionable (tell the user what to do next)
|
|
57
|
+
|
|
58
|
+
## Gotchas
|
|
59
|
+
|
|
60
|
+
- `uv sync --extra mlx` は dev deps (ruff, pytest) を外す。lint/test が必要なら `uv sync --extra mlx --extra dev` するか、`uvx ruff` / `uvx pytest` を使う
|
|
61
|
+
- mlx-lm の upload 機能は内部関数 (`utils.upload_to_hub`) で公開 API ではない。castkit では `huggingface_hub.HfApi` で直接アップロードしている
|
|
62
|
+
- HF upload は `api.upload_folder` を使用。`api.upload_large_folder` もあるが小〜中規模モデルでは `upload_folder` で十分
|
|
63
|
+
- homebrew の llama.cpp (convert_hf_to_gguf.py) は PyPI の gguf パッケージより新しいバージョンを要求する場合がある。GGUF convert が ImportError で失敗したら `uv pip install 'gguf @ git+https://github.com/ggml-org/llama.cpp#subdirectory=gguf-py'` で更新する
|
|
64
|
+
|
|
65
|
+
## Reference Docs
|
|
66
|
+
|
|
67
|
+
- docs/project-overview.md - project vision, CLI design, competitive landscape
|
|
68
|
+
- docs/implementation-plan.md - decisions, package structure, implementation details
|
|
69
|
+
- docs/quantization-formats.md - all quantization format specs
|
|
70
|
+
- docs/conversion-paths.md - format conversion matrix and paths
|
|
71
|
+
- docs/trends-and-benchmarks.md - industry trends, benchmark data
|
|
72
|
+
- docs/cli-architecture.md - existing tool architectures, dependency info
|
castkit-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 schroneko
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
castkit-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: castkit
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Universal model quantization and format conversion CLI
|
|
5
|
+
Project-URL: Repository, https://github.com/schroneko/castkit
|
|
6
|
+
Author: schroneko
|
|
7
|
+
License-Expression: MIT
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Python: >=3.12
|
|
10
|
+
Requires-Dist: huggingface-hub>=0.25
|
|
11
|
+
Requires-Dist: rich>=13
|
|
12
|
+
Requires-Dist: typer>=0.15
|
|
13
|
+
Provides-Extra: all
|
|
14
|
+
Requires-Dist: accelerate>=1.0; extra == 'all'
|
|
15
|
+
Requires-Dist: datasets>=3.0; extra == 'all'
|
|
16
|
+
Requires-Dist: gguf>=0.10; extra == 'all'
|
|
17
|
+
Requires-Dist: gptqmodel>=2.0; extra == 'all'
|
|
18
|
+
Requires-Dist: mlx-lm>=0.20; extra == 'all'
|
|
19
|
+
Requires-Dist: mlx>=0.22; extra == 'all'
|
|
20
|
+
Requires-Dist: numpy>=1.26; extra == 'all'
|
|
21
|
+
Requires-Dist: safetensors>=0.4; extra == 'all'
|
|
22
|
+
Requires-Dist: sentencepiece>=0.2; extra == 'all'
|
|
23
|
+
Requires-Dist: torch>=2.4; extra == 'all'
|
|
24
|
+
Requires-Dist: transformers>=4.45; extra == 'all'
|
|
25
|
+
Provides-Extra: dev
|
|
26
|
+
Requires-Dist: pytest>=8; extra == 'dev'
|
|
27
|
+
Requires-Dist: ruff>=0.9; extra == 'dev'
|
|
28
|
+
Provides-Extra: gguf
|
|
29
|
+
Requires-Dist: gguf>=0.10; extra == 'gguf'
|
|
30
|
+
Requires-Dist: numpy>=1.26; extra == 'gguf'
|
|
31
|
+
Requires-Dist: safetensors>=0.4; extra == 'gguf'
|
|
32
|
+
Requires-Dist: sentencepiece>=0.2; extra == 'gguf'
|
|
33
|
+
Requires-Dist: torch>=2.4; extra == 'gguf'
|
|
34
|
+
Requires-Dist: transformers>=4.45; extra == 'gguf'
|
|
35
|
+
Provides-Extra: gptq
|
|
36
|
+
Requires-Dist: accelerate>=1.0; extra == 'gptq'
|
|
37
|
+
Requires-Dist: datasets>=3.0; extra == 'gptq'
|
|
38
|
+
Requires-Dist: gptqmodel>=2.0; extra == 'gptq'
|
|
39
|
+
Requires-Dist: safetensors>=0.4; extra == 'gptq'
|
|
40
|
+
Requires-Dist: torch>=2.4; extra == 'gptq'
|
|
41
|
+
Requires-Dist: transformers>=4.45; extra == 'gptq'
|
|
42
|
+
Provides-Extra: mlx
|
|
43
|
+
Requires-Dist: mlx-lm>=0.20; extra == 'mlx'
|
|
44
|
+
Requires-Dist: mlx>=0.22; extra == 'mlx'
|
castkit-0.1.0/README.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
# castkit
|
|
2
|
+
|
|
3
|
+
castkit is a CLI tool for model quantization and format conversion across GGUF, MLX, GPTQ, and AWQ workflows, including cross-format conversion via automatic FP16 decast.
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
### Homebrew (macOS)
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
brew install schroneko/castkit/castkit
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
### pip / uv
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
uv tool install castkit # core only
|
|
17
|
+
uv tool install castkit[mlx] # MLX backend (Apple Silicon)
|
|
18
|
+
uv tool install castkit[gguf] # GGUF backend (requires torch)
|
|
19
|
+
uv tool install castkit[all] # all backends
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Quick Start
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
# convert
|
|
26
|
+
castkit convert Qwen/Qwen3-0.6B -f gguf -q q4_k_m -o ./output/Qwen3-0.6B.gguf
|
|
27
|
+
|
|
28
|
+
# decast (dequantize back to FP16 SafeTensors)
|
|
29
|
+
castkit decast ./output/Qwen3-0.6B.gguf -o ./output/Qwen3-0.6B-fp16
|
|
30
|
+
|
|
31
|
+
# model info
|
|
32
|
+
castkit info ./output/Qwen3-0.6B.gguf
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Supported Formats
|
|
36
|
+
|
|
37
|
+
| Format | Convert | Decast | Info | Measure |
|
|
38
|
+
| -------------- | ------------------ | ------ | ---- | ----------------- |
|
|
39
|
+
| GGUF | Yes | Yes | Yes | Yes |
|
|
40
|
+
| MLX | Yes | Yes | Yes | Yes |
|
|
41
|
+
| GPTQ | Yes | Yes | Yes | Yes |
|
|
42
|
+
| AWQ | Yes | Yes | Yes | Yes |
|
|
43
|
+
| FP16/BF16/FP32 | Input/Intermediate | N/A | Yes | Backend-dependent |
|
|
44
|
+
|
|
45
|
+
## License
|
|
46
|
+
|
|
47
|
+
MIT
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
[default]
|
|
2
|
+
output_dir = "./output"
|
|
3
|
+
|
|
4
|
+
[recipes.gguf-standard]
|
|
5
|
+
format = "gguf"
|
|
6
|
+
quant = "q4_k_m"
|
|
7
|
+
imatrix = true
|
|
8
|
+
imatrix_data = "calibration.txt"
|
|
9
|
+
|
|
10
|
+
[recipes.gguf-all]
|
|
11
|
+
format = "gguf"
|
|
12
|
+
quant = ["q8_0", "q6_k", "q5_k_m", "q4_k_m", "q3_k_m", "q2_k"]
|
|
13
|
+
imatrix = true
|
|
14
|
+
|
|
15
|
+
[recipes.mlx-4bit]
|
|
16
|
+
format = "mlx"
|
|
17
|
+
bits = 4
|
|
18
|
+
group_size = 64
|
|
19
|
+
|
|
20
|
+
[recipes.mlx-8bit]
|
|
21
|
+
format = "mlx"
|
|
22
|
+
bits = 8
|
|
23
|
+
group_size = 64
|
|
24
|
+
|
|
25
|
+
[recipes.awq-4bit]
|
|
26
|
+
format = "awq"
|
|
27
|
+
bits = 4
|
|
28
|
+
group_size = 128
|
|
29
|
+
|
|
30
|
+
[recipes.gptq-4bit]
|
|
31
|
+
format = "gptq"
|
|
32
|
+
bits = 4
|
|
33
|
+
group_size = 128
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
## Existing Tool Architectures
|
|
2
|
+
|
|
3
|
+
### llama.cpp Pipeline
|
|
4
|
+
|
|
5
|
+
Two-step process:
|
|
6
|
+
|
|
7
|
+
1. convert_hf_to_gguf.py (Python): HF model -> F16 GGUF
|
|
8
|
+
- Architecture detection via config.json
|
|
9
|
+
- Model class instantiation via @register decorator (80+ architectures)
|
|
10
|
+
- Lazy tensor loading (no memory consumption during indexing)
|
|
11
|
+
- TensorNameMap translates HF names -> GGUF names
|
|
12
|
+
- Supports SafeTensors (mmap), PyTorch, remote HF streaming
|
|
13
|
+
- Dependencies: numpy, torch, safetensors, sentencepiece, gguf
|
|
14
|
+
|
|
15
|
+
2. llama-quantize (C++): F16 GGUF -> quantized GGUF
|
|
16
|
+
- Sequential tensor processing with mmap
|
|
17
|
+
- imatrix support via --imatrix flag
|
|
18
|
+
- Per-tensor type control via --tensor-type REGEX:TYPE
|
|
19
|
+
- Can process any model size on CPU (no GPU needed for basic quant)
|
|
20
|
+
|
|
21
|
+
### mlx-lm convert
|
|
22
|
+
|
|
23
|
+
Single-step Python:
|
|
24
|
+
|
|
25
|
+
- Downloads HF model via transformers/huggingface_hub
|
|
26
|
+
- Maps weights to MLX arrays
|
|
27
|
+
- Optional quantization: to_quantized(group_size, bits) on Linear/Embedding layers
|
|
28
|
+
- Outputs SafeTensors + config.json
|
|
29
|
+
- --upload-repo for direct HF Hub upload
|
|
30
|
+
- Dependencies: mlx, mlx-lm, transformers, huggingface_hub, safetensors
|
|
31
|
+
- Apple Silicon only
|
|
32
|
+
|
|
33
|
+
### GPTQModel
|
|
34
|
+
|
|
35
|
+
Python library + CLI:
|
|
36
|
+
|
|
37
|
+
- Layer-by-layer Hessian-based optimization
|
|
38
|
+
- Multi-GPU data-parallel quantization
|
|
39
|
+
- Supports GPTQ, AWQ, QQQ, GPTAQ, EoRA, GAR methods
|
|
40
|
+
- Dynamic per-module mixed quantization
|
|
41
|
+
- Inference kernels: Marlin (fastest), ExLlama V2/V1, Triton, Torch, BitBLAS
|
|
42
|
+
|
|
43
|
+
### AutoAWQ (archived May 2025)
|
|
44
|
+
|
|
45
|
+
Python library:
|
|
46
|
+
|
|
47
|
+
- Activation analysis -> salient channel identification -> per-channel scaling
|
|
48
|
+
- 10-30 min for 7B (much faster than GPTQ)
|
|
49
|
+
- Replaced by llm-compressor (AWQModifier) and GPTQModel
|
|
50
|
+
|
|
51
|
+
### ExLlamaV2 (EXL2)
|
|
52
|
+
|
|
53
|
+
Two-phase Python:
|
|
54
|
+
|
|
55
|
+
1. Measurement: model quantized ~12 times with different params, error measured per layer. Saved to measurement.json.
|
|
56
|
+
2. Quantization: optimizer selects per-layer bit allocations to minimize total error at target bpw.
|
|
57
|
+
|
|
58
|
+
- Memory: only largest transformer layer must fit in VRAM
|
|
59
|
+
- 7B: ~16 GB RAM, ~8 GB VRAM. 70B: ~64 GB RAM, ~24 GB VRAM.
|
|
60
|
+
|
|
61
|
+
### quantkit (llm-quantkit)
|
|
62
|
+
|
|
63
|
+
Thin Python CLI wrapper:
|
|
64
|
+
|
|
65
|
+
```
|
|
66
|
+
quantkit/
|
|
67
|
+
cli.py # CLI entry point
|
|
68
|
+
quantkit.py # Core orchestration
|
|
69
|
+
convert.py # General conversion logic
|
|
70
|
+
convert_exl2.py # EXL2-specific
|
|
71
|
+
convert_hf.py # HuggingFace download/conversion
|
|
72
|
+
safetensor.py # SafeTensor utilities
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
- Delegates to respective libraries for each format
|
|
76
|
+
- Limitations: no MLX, no ONNX, limited config exposure, Python 3.12+ issues, no imatrix generation, no batch multi-format quantization
|
|
77
|
+
|
|
78
|
+
## Memory Requirements for Quantization
|
|
79
|
+
|
|
80
|
+
Rule of thumb: FP16 model requires ~2 GB per billion parameters.
|
|
81
|
+
|
|
82
|
+
| Model Size | FP16 Size | GPTQ/AWQ VRAM | GGUF (CPU) | EXL2 VRAM |
|
|
83
|
+
| ---------- | --------- | -------------- | ---------- | ----------------------- |
|
|
84
|
+
| 7B | ~14 GB | ~16-24 GB | CPU only | ~8 GB VRAM + 16 GB RAM |
|
|
85
|
+
| 13B | ~26 GB | ~32-48 GB | CPU only | ~16 GB VRAM + 32 GB RAM |
|
|
86
|
+
| 34B | ~68 GB | Fails on 24 GB | CPU only | ~24 GB VRAM + 64 GB RAM |
|
|
87
|
+
| 70B | ~140 GB | Fails on 24 GB | CPU only | ~24 GB VRAM + 64 GB RAM |
|
|
88
|
+
|
|
89
|
+
## Common Dependencies
|
|
90
|
+
|
|
91
|
+
Core (required by almost all):
|
|
92
|
+
|
|
93
|
+
- torch/pytorch: tensor operations, model loading
|
|
94
|
+
- transformers: model architectures, tokenizers, from_pretrained
|
|
95
|
+
- safetensors: efficient weight storage (standard on HF)
|
|
96
|
+
- huggingface_hub: model downloading (snapshot_download, hf_hub_download)
|
|
97
|
+
- numpy: numerical operations
|
|
98
|
+
- sentencepiece: tokenizer (LLaMA, Gemma, etc.)
|
|
99
|
+
- accelerate: multi-GPU model loading and sharding
|
|
100
|
+
- datasets: calibration dataset loading
|
|
101
|
+
|
|
102
|
+
Format-specific:
|
|
103
|
+
|
|
104
|
+
- gguf (from llama.cpp): GGUF read/write
|
|
105
|
+
- gptqmodel: GPTQ quantization
|
|
106
|
+
- autoawq / llm-compressor: AWQ quantization
|
|
107
|
+
- exllamav2: EXL2 quantization
|
|
108
|
+
- bitsandbytes: NF4/INT8 runtime
|
|
109
|
+
- optimum / optimum-onnx: ONNX export
|
|
110
|
+
- onnxruntime: ONNX inference/quantization
|
|
111
|
+
- hqq: Half-Quadratic Quantization
|
|
112
|
+
- mlx, mlx-lm: MLX format (Apple Silicon only)
|
|
113
|
+
|
|
114
|
+
## Model Download Patterns
|
|
115
|
+
|
|
116
|
+
Two approaches:
|
|
117
|
+
|
|
118
|
+
1. transformers.AutoModelForCausalLM.from_pretrained(): loads into memory. Used by GPTQ/AWQ that need model for calibration.
|
|
119
|
+
2. huggingface_hub.snapshot_download(): downloads to cache without loading. Used by llama.cpp and file-based tools.
|
|
120
|
+
|
|
121
|
+
HF cache: ~/.cache/huggingface/hub/ with content-addressable storage + symlinks.
|
|
122
|
+
|
|
123
|
+
## HuggingFace Transformers Quantization Integration
|
|
124
|
+
|
|
125
|
+
| Format | Loading | Backend | Auto-detection |
|
|
126
|
+
| ------------------ | --------------------------------------------------------- | ------------------ | -------------------- |
|
|
127
|
+
| GPTQ | from_pretrained() | GPTQModel | quantize_config.json |
|
|
128
|
+
| AWQ | from_pretrained() | AutoAWQ | config |
|
|
129
|
+
| BnB | from_pretrained(quantization_config=BitsAndBytesConfig()) | bitsandbytes | explicit config |
|
|
130
|
+
| HQQ | from_pretrained(quantization_config=HqqConfig()) | HQQ | explicit config |
|
|
131
|
+
| GGUF | from_pretrained(gguf_file="model.gguf") | gguf-py | explicit param |
|
|
132
|
+
| compressed-tensors | from_pretrained() | compressed-tensors | auto |
|
|
133
|
+
| EXL2/EXL3 | NOT supported | ExLlama only | - |
|
|
134
|
+
|
|
135
|
+
Note: transformers loads GGUF by dequantizing to FP32. Uses same memory as FP16 model. For fine-tuning/conversion only, not inference.
|
|
136
|
+
|
|
137
|
+
## Design Considerations for castkit
|
|
138
|
+
|
|
139
|
+
1. Format-specific backends are fundamentally different: GGUF is CPU-based C++ (fast, no GPU), GPTQ/AWQ require GPU + calibration, EXL2 has two-phase flow, ONNX has its own pipeline. Must orchestrate as separate backends.
|
|
140
|
+
|
|
141
|
+
2. Calibration divides UX: GGUF quantization takes minutes with no calibration. GPTQ/AWQ/EXL2 take hours with calibration data. Surface this difference clearly.
|
|
142
|
+
|
|
143
|
+
3. Memory constraints vary: GGUF can quantize any model on CPU. GPTQ/AWQ fail for models larger than VRAM. EXL2 only needs largest layer to fit. Surface constraints upfront.
|
|
144
|
+
|
|
145
|
+
4. quantkit's thin-wrapper approach works but is limited. Room for improvement: MLX support, ONNX support, imatrix generation, batch multi-format output, better config exposure.
|
|
146
|
+
|
|
147
|
+
5. Ecosystem consolidation: AutoAWQ archived, GPTQModel expanding, llm-compressor for vLLM. Track these shifts.
|
|
148
|
+
|
|
149
|
+
6. Common infrastructure: HF download, model metadata parsing, output upload. Natural abstraction layer.
|
|
150
|
+
|
|
151
|
+
7. Intel AutoRound can export to multiple formats from single quantization run - consider similar multi-output capability.
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
# Conversion Paths Reference
|
|
2
|
+
|
|
3
|
+
All conversion paths between quantization formats, with tooling and quality implications.
|
|
4
|
+
|
|
5
|
+
## Terminology
|
|
6
|
+
|
|
7
|
+
- GGUF: self-contained binary file format (.gguf) by llama.cpp
|
|
8
|
+
- SafeTensors: secure tensor serialization container (.safetensors) by HuggingFace. Standard container for HF-ecosystem models.
|
|
9
|
+
- GPTQ, AWQ, EXL2/EXL3, HQQ, bitsandbytes: quantization algorithms/methods. Output stored in SafeTensors with metadata JSON.
|
|
10
|
+
- MLX: Apple Silicon native format using SafeTensors-style storage.
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Direct Paths (FP16/BF16 source)
|
|
15
|
+
|
|
16
|
+
All quantization methods are designed to take full-precision as input. This is the golden path.
|
|
17
|
+
|
|
18
|
+
| Target | Tool | Time (7B) | Calibration |
|
|
19
|
+
| ------------------ | -------------------------------------- | ------------- | ------------------------- |
|
|
20
|
+
| GGUF | convert_hf_to_gguf.py + llama-quantize | Minutes | Optional (imatrix) |
|
|
21
|
+
| GPTQ | GPTQModel | 2-4 hours | 128 samples from C4 |
|
|
22
|
+
| AWQ | AutoAWQ / llm-compressor | 10-30 min | Small dataset |
|
|
23
|
+
| EXL2 | ExLlamaV2 | Hours | Yes (measurement phase) |
|
|
24
|
+
| EXL3 | ExLlamaV3 | Hours | Yes (trellis calibration) |
|
|
25
|
+
| BnB NF4/INT8 | bitsandbytes | Runtime | None |
|
|
26
|
+
| HQQ | HQQ | <5 min (70B) | None |
|
|
27
|
+
| MLX | mlx_lm.convert | Minutes | None (affine) |
|
|
28
|
+
| ONNX | HF Optimum / ONNX Runtime | Minutes-Hours | Static: Yes |
|
|
29
|
+
| compressed-tensors | llm-compressor | Varies | Yes |
|
|
30
|
+
| FP8 | llm-compressor / TensorRT | Fast | Optional |
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Cross-Format Direct Conversions
|
|
35
|
+
|
|
36
|
+
| Source -> Target | Tool | Notes |
|
|
37
|
+
| ------------------------ | ------------------------------- | ------------------------------------------------------------------------------ |
|
|
38
|
+
| GPTQ -> GGUF | llama.cpp convert_hf_to_gguf.py | Dequantizes GPTQ internally, then converts. Detects GPTQ config automatically. |
|
|
39
|
+
| AWQ -> GGUF | llama.cpp convert_hf_to_gguf.py | Requires export_compatible=True when creating AWQ model. Preserves AWQ scales. |
|
|
40
|
+
| GPTQ -> GGUF (optimized) | gptq-gguf-toolkit (IST-DASLab) | Better quality: applies GPTQ error correction during K-quant quantization. |
|
|
41
|
+
| GGUF -> MLX | gguf2mlx (community) | Dequantizes then re-encodes. Limited quant type support. |
|
|
42
|
+
| FP16 SafeTensors -> MLX | mlx_lm.convert | Official. Can quantize during conversion. |
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Indirect Paths (via FP16 intermediate)
|
|
47
|
+
|
|
48
|
+
Most cross-format conversions require dequantizing to FP16 first, then re-quantizing. This causes double precision loss.
|
|
49
|
+
|
|
50
|
+
AWQ -> GGUF (recommended):
|
|
51
|
+
|
|
52
|
+
1. Quantize with AutoAWQ using export_compatible=True
|
|
53
|
+
2. Convert with convert_hf_to_gguf.py (preserves AWQ scales)
|
|
54
|
+
3. Quantize with llama-quantize
|
|
55
|
+
|
|
56
|
+
GPTQ <-> AWQ:
|
|
57
|
+
|
|
58
|
+
1. Dequantize to FP16
|
|
59
|
+
2. Re-quantize with target tool
|
|
60
|
+
|
|
61
|
+
- Much better to start from original FP16 if available
|
|
62
|
+
|
|
63
|
+
GGUF -> GPTQ/AWQ/EXL2:
|
|
64
|
+
|
|
65
|
+
1. Dequantize GGUF to FP16 SafeTensors (gguf-py dequantize() or gguf_to_safetensors)
|
|
66
|
+
2. Re-quantize with target tool
|
|
67
|
+
|
|
68
|
+
- Quality depends heavily on source GGUF bit width (Q8_0 OK, Q4 noticeable loss)
|
|
69
|
+
|
|
70
|
+
BnB NF4 -> any:
|
|
71
|
+
|
|
72
|
+
1. model.dequantize() in transformers
|
|
73
|
+
2. Save as SafeTensors
|
|
74
|
+
3. Convert to target
|
|
75
|
+
|
|
76
|
+
Any -> MLX:
|
|
77
|
+
|
|
78
|
+
1. Get to FP16 SafeTensors
|
|
79
|
+
2. mlx_lm.convert with optional quantization flags
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## Re-quantization Within Same Format
|
|
84
|
+
|
|
85
|
+
| Format | Tool | Notes |
|
|
86
|
+
| -------------------------- | --------------------------------- | ---------------------------------------------------------------- |
|
|
87
|
+
| GGUF (e.g. Q8_0 -> Q4_K_M) | llama-quantize --allow-requantize | Quality degrades vs quantizing from FP16. Tool warns about this. |
|
|
88
|
+
| GPTQ/AWQ/EXL2/BnB | None | Must dequantize to FP16 first, then re-quantize. |
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Dequantization (Quantized -> FP16)
|
|
93
|
+
|
|
94
|
+
All dequantization is lossy - the original precision is permanently lost.
|
|
95
|
+
|
|
96
|
+
| Format | Tool | Quality (vs original) |
|
|
97
|
+
| ------------- | --------------------------------- | -------------------------------------------------------- |
|
|
98
|
+
| GGUF Q8_0 | gguf-py dequantize() | Very close (~0.1% PPL diff) |
|
|
99
|
+
| GGUF Q6_K | Same | Minor degradation |
|
|
100
|
+
| GGUF Q4_K_M | Same | Noticeable (~1-3% PPL increase) |
|
|
101
|
+
| GGUF Q2_K/IQ2 | Same | Severe - not useful for re-quantization |
|
|
102
|
+
| GPTQ 4-bit | GPTQModel / transformers | Moderate - rounding errors visible |
|
|
103
|
+
| AWQ 4-bit | Load + extract weights | Similar to GPTQ |
|
|
104
|
+
| EXL2/EXL3 | ExLlama loader | Depends on bpw (6+ bpw very close) |
|
|
105
|
+
| BnB NF4 | model.dequantize() | Good for 4-bit (NF4 is optimal for normal distributions) |
|
|
106
|
+
| BnB INT8 | Same | Very close to original |
|
|
107
|
+
| HQQ | Native dequant (linear operation) | Depends on bit width |
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## Impossible / Impractical Conversions
|
|
112
|
+
|
|
113
|
+
- GPTQ <-> AWQ (direct): fundamentally different algorithms. No mathematical mapping.
|
|
114
|
+
- EXL2 <-> EXL3 (direct): different quantization schemes (GPTQ-based vs QTIP-based).
|
|
115
|
+
- Any quantized -> original FP16: information permanently lost.
|
|
116
|
+
- GGUF Q2_K -> any high-quality format: precision loss too severe.
|
|
117
|
+
- BnB NF4 -> GGUF/GPTQ (direct): BnB is runtime library, must dequantize first.
|
|
118
|
+
- ONNX quantized -> GGUF/GPTQ (direct): completely separate ecosystems.
|
|
119
|
+
- 4-bit -> 8-bit for quality improvement: higher bit width does not recover lost information.
|
|
120
|
+
|
|
121
|
+
---
|
|
122
|
+
|
|
123
|
+
## Conversion Matrix
|
|
124
|
+
|
|
125
|
+
```
|
|
126
|
+
Source -> | FP16 | GGUF | GPTQ | AWQ | EXL2 | EXL3 | BnB | HQQ | MLX | ONNX
|
|
127
|
+
-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|------
|
|
128
|
+
FP16 | - | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes
|
|
129
|
+
GGUF | Deq | Re* | Indir | Indir | Indir | Indir | Indir | Indir | Deq | Indir
|
|
130
|
+
GPTQ | Deq | Yes** | - | Indir | Indir | Indir | Indir | Indir | Indir | Indir
|
|
131
|
+
AWQ | Deq | Yes***| Indir | - | Indir | Indir | Indir | Indir | Indir | Indir
|
|
132
|
+
EXL2 | Deq | Indir | Indir | Indir | - | No | Indir | Indir | Indir | Indir
|
|
133
|
+
EXL3 | Deq | Indir | Indir | Indir | No | - | Indir | Indir | Indir | Indir
|
|
134
|
+
BnB | Deq | Indir | Indir | Indir | Indir | Indir | - | Indir | Indir | Indir
|
|
135
|
+
HQQ | Deq | Indir | Indir | Indir | Indir | Indir | Indir | - | Indir | Indir
|
|
136
|
+
MLX | Deq | Indir | Indir | Indir | Indir | Indir | Indir | Indir | - | Indir
|
|
137
|
+
ONNX | Deq | Indir | Indir | Indir | Indir | Indir | Indir | Indir | Indir | -
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
Legend:
|
|
141
|
+
|
|
142
|
+
- Yes = direct, well-supported path
|
|
143
|
+
- Deq = dequantize to approximate FP16 (lossy)
|
|
144
|
+
- Re\* = GGUF-to-GGUF requantize with --allow-requantize
|
|
145
|
+
- Yes\*\* = llama.cpp handles GPTQ dequant internally
|
|
146
|
+
- Yes\*\*\* = via AWQ export_compatible=True path
|
|
147
|
+
- Indir = indirect, requires dequantize-to-FP16 then re-quantize (double loss)
|
|
148
|
+
- No = not feasible (incompatible algorithms)
|
|
149
|
+
|
|
150
|
+
---
|
|
151
|
+
|
|
152
|
+
## Key Tools Reference
|
|
153
|
+
|
|
154
|
+
| Tool | Repository | Purpose |
|
|
155
|
+
| ----------------- | ----------------------------------------- | ------------------------------------------------ |
|
|
156
|
+
| llama.cpp | ggml-org/llama.cpp | GGUF conversion and quantization |
|
|
157
|
+
| GPTQModel | ModelCloud/GPTQModel | GPTQ/AWQ/GPTAQ quantization |
|
|
158
|
+
| AutoAWQ | casper-hansen/AutoAWQ (archived May 2025) | AWQ quantization |
|
|
159
|
+
| ExLlamaV2 | turboderp-org/exllamav2 | EXL2 quantization and inference |
|
|
160
|
+
| ExLlamaV3 | turboderp-org/exllamav3 | EXL3 quantization and inference |
|
|
161
|
+
| HQQ | dropbox/hqq | Half-Quadratic Quantization |
|
|
162
|
+
| quantkit | xhedit/quantkit | Multi-format CLI (GGUF, GPTQ, AWQ, HQQ, EXL2) |
|
|
163
|
+
| Intel AutoRound | intel/auto-round | Multi-format export (GPTQ, AWQ, GGUF, AutoRound) |
|
|
164
|
+
| llm-compressor | vllm-project/llm-compressor | compressed-tensors for vLLM |
|
|
165
|
+
| gptq-gguf-toolkit | IST-DASLab/gptq-gguf-toolkit | GPTQ-optimized GGUF quantization |
|
|
166
|
+
| mlx-lm | ml-explore/mlx-lm | MLX format conversion and quantization |
|
|
167
|
+
| bitsandbytes | bitsandbytes-foundation/bitsandbytes | Runtime INT8/NF4 quantization |
|
|
168
|
+
| HF Optimum | huggingface/optimum-onnx | ONNX export and quantization |
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## Practical Recommendations
|
|
173
|
+
|
|
174
|
+
1. Always keep the original FP16/BF16 model. It is the best source for any quantization.
|
|
175
|
+
2. Intel AutoRound can export to GPTQ, AWQ, GGUF, and AutoRound from a single run.
|
|
176
|
+
3. For GGUF, best quality: FP16 -> F16 GGUF -> llama-quantize with imatrix.
|
|
177
|
+
4. For GPTQ-to-GGUF, gptq-gguf-toolkit produces better results than naive conversion.
|
|
178
|
+
5. 8-bit formats (FP8, INT8, Q8_0) are near-lossless and safest for intermediate conversion.
|
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
# DGX Spark Testing Guide
|
|
2
|
+
|
|
3
|
+
## 1. Prerequisites
|
|
4
|
+
|
|
5
|
+
- Python 3.12+
|
|
6
|
+
- CUDA 12.8+
|
|
7
|
+
- `uv`
|
|
8
|
+
- NVIDIA DGX Spark environment (ARM64)
|
|
9
|
+
|
|
10
|
+
## 2. Setup
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
git clone https://github.com/schroneko/castkit.git
|
|
14
|
+
cd castkit
|
|
15
|
+
uv sync --extra gptq --extra gguf --extra dev
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
Optional (for upload-related tests):
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
uv add huggingface-hub
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
## 3. MLX Support Note
|
|
25
|
+
|
|
26
|
+
MLX is Apple Silicon focused and is not supported on DGX Spark. Skip MLX runtime quantization/decast tests on this platform.
|
|
27
|
+
|
|
28
|
+
## 4. Test Phases
|
|
29
|
+
|
|
30
|
+
### Phase 1: Basic Validation
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
uv run pytest
|
|
34
|
+
uv run ruff check src/ tests/
|
|
35
|
+
uv run castkit --help
|
|
36
|
+
uv run castkit --version
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### Phase 2: Decast (CPU)
|
|
40
|
+
|
|
41
|
+
No GPU is required.
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
uv run castkit decast ./path/to/model-quantized -o ./output/decast-fp16
|
|
45
|
+
uv run castkit info ./output/decast-fp16
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Phase 3: GPU Quantization (GPTQ/AWQ)
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
uv run castkit convert ./path/to/fp16-model -f gptq -b 4 -g 128 -o ./output/model-gptq
|
|
52
|
+
uv run castkit convert ./path/to/fp16-model -f awq -b 4 -g 128 -o ./output/model-awq
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Phase 4: GGUF Conversion
|
|
56
|
+
|
|
57
|
+
Requires `llama.cpp` build with `llama-quantize`, `llama-perplexity`, and `convert_hf_to_gguf.py`.
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
uv run castkit convert ./path/to/fp16-model -f gguf -q q4_k_m -o ./output/model-q4_k_m.gguf
|
|
61
|
+
uv run castkit info ./output/model-q4_k_m.gguf
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Phase 5: Cross-Format Conversion
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
uv run castkit convert ./output/model-q4_k_m.gguf -f gptq -b 4 -o ./output/model-gptq-from-gguf
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Phase 6: Perplexity Measurement
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
uv run castkit measure ./output/model-q4_k_m.gguf --dataset wikitext-2 --max-samples 128
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## 5. Recommended Test Models
|
|
77
|
+
|
|
78
|
+
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
|
79
|
+
- Qwen/Qwen2.5-0.5B-Instruct
|
|
80
|
+
|
|
81
|
+
Use smaller models first to validate flow, then scale up.
|
|
82
|
+
|
|
83
|
+
## 6. DGX Spark Specific Notes
|
|
84
|
+
|
|
85
|
+
- Platform is ARM64; validate wheel/ABI availability for all dependencies.
|
|
86
|
+
- `gptqmodel` may require local compilation steps depending on CUDA/toolchain state.
|
|
87
|
+
- Verify compatibility with Blackwell SM targets in CUDA extension builds.
|
|
88
|
+
- Prefer explicit CUDA environment variables when troubleshooting build/runtime mismatch.
|
|
89
|
+
|
|
90
|
+
## 7. Troubleshooting
|
|
91
|
+
|
|
92
|
+
- `Backend '...' is not available`: missing extras or runtime dependencies. Re-run `uv sync --extra ...`.
|
|
93
|
+
- `Command not found: llama-quantize`: install/build `llama.cpp` and add binaries to `PATH`.
|
|
94
|
+
- `CUDA was not detected`: confirm driver/toolkit setup and `torch.cuda.is_available()`.
|
|
95
|
+
- `Could not parse perplexity`: capture full `llama-perplexity` stdout and verify output format.
|
|
96
|
+
- Slow/unstable tests: reduce model size, reduce `--max-samples`, and run phase-by-phase.
|