PyPI - modelinfo-cli - Versions diffs - 1.4.0__tar.gz → 1.4.2__tar.gz - Mend

modelinfo-cli 1.4.0tar.gz → 1.4.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{modelinfo_cli-1.4.0/src/modelinfo_cli.egg-info → modelinfo_cli-1.4.2}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: modelinfo-cli
-Version: 1.4.0
-Summary: A sub-100ms, zero-dependency CLI to inspect ML models (.safetensors, .gguf) locally or via Hugging Face, calculate exact VRAM footprints, and determine hardware fit.
+Version: 1.4.2
+Summary: A CLI tool to inspect ML checkpoints (.safetensors, .gguf, .pt) and calculate inference VRAM, multi-GPU memory splits, and vLLM serving capacity.
 Author: ModelInfo Contributors
 License: MIT
 Requires-Python: >=3.10
@@ -26,14 +26,14 @@ Dynamic: license-file
 ModelInfo is a CLI tool that inspects machine learning model checkpoints (`.safetensors`, `.gguf`, `.pt`) and calculates hardware requirements completely offline.
-It reads binary headers directly using the Python standard library. By bypassing full tensor payload loading and strictly excluding heavy ecosystems like PyTorch or HuggingFace, the tool executes in under 100 milliseconds.
+It reads binary headers directly using the Python standard library. It skips the full tensor payload entirely (no PyTorch, no HuggingFace) and parses in under 100ms.
 ## Features
 - **Zero-Dependency Parsing**: Reads `.safetensors` 8-byte JSON prefixes and `.gguf` binary key-value metadata directly via `struct` and `json` (falling back to `config.json` if needed).
 - **Remote Hugging Face Hub Inspection**: Pass a repo ID (e.g., `meta-llama/Llama-2-7b-hf`) and it uses concurrent byte-range requests to read the headers off the CDN in under 2 seconds. No need to download the checkpoint.
 - Parses `model.safetensors.index.json` to support sharded models without crashing on partial downloads.
-- **Dynamic VRAM & Subtractive vLLM Math**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a subtractive "Serving Capacity" engine that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
+- **Dynamic VRAM & vLLM Capacity Planning**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a "Serving Capacity" simulation that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
 - **Hardware Fit Diagnostics**: Check if a model fits your cluster with `--gpu` (e.g. `--gpu RTX4090` or `--gpu auto`). It enforces Apple Silicon's 75% unified memory wire limit, and you can explicitly model multi-GPU NCCL communication penalties with `--topology` and `--strategy`.
 - **Side-by-Side Comparison**: Pass multiple models to trigger a comparison table (parameters, data types, context lengths, VRAM footprints).
 - Uses exact `ggml_type` mappings for GGUF formats to calculate byte-scaling coefficients, preventing VRAM under-reporting.
@@ -66,7 +66,7 @@ pip install -e ".[dev]"
 ## Testing
-The testing suite enforces cross-platform structural integrity and guards the zero-dependency latency constraint. Tests are isolated against custom binary mocks in `tests/fixtures/`.
+Tests cover the binary parsers and verify the sub-100ms local parse constraint using binary mocks in `tests/fixtures/`.
 Run the test suite using pytest:
@@ -165,7 +165,7 @@ Qwen2.5-0.5B       494.0M    BF16     8K         1.6 GB      ✓
 | `--gpu` | `--gpu rtx4090` | Check if the model fits. Accepts GPU names (`rtx4090`, `b200`, `rx7900xtx`), explicit VRAM limits in GB (`--gpu 24`), or local hardware auto-discovery (`--gpu auto`). |
 | `--context` | `--context 32768` | Adjust the target KV cache length. Essential for calculating the dynamic memory footprint of long-context models. Defaults to `8192`. |
 | `--max-vram` | `--max-vram 80` | Adjusts the color-coded heat mapping thresholds (Green/Yellow/Red) in the terminal output to match a specific hardware ceiling. |
-| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a subtractive serving capacity estimation. Shows exactly how many tokens fit in the PagedAttention pool. |
+| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a serving capacity simulation. Shows exactly how many tokens fit in the PagedAttention pool. |
 | `--gpu-util` | `--gpu-util 0.9` | Sets the vLLM `gpu_memory_utilization` ratio. Defaults to `0.9` (reserves 10% for PyTorch context). |
 | `--topology` | `--topology nvlink` | Set interconnect topology to calculate exact communication overhead penalties (`nvlink`, `pcie4`, `pcie3`). Defaults to `pcie4`. |
 | `--strategy` | `--strategy tp` | Selects the parallelization strategy for multi-GPU setups (`tp` for Tensor Parallelism, `pp` for Pipeline Parallelism). Defaults to `tp`. |

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/README.md RENAMED Viewed

@@ -8,14 +8,14 @@
 ModelInfo is a CLI tool that inspects machine learning model checkpoints (`.safetensors`, `.gguf`, `.pt`) and calculates hardware requirements completely offline.
-It reads binary headers directly using the Python standard library. By bypassing full tensor payload loading and strictly excluding heavy ecosystems like PyTorch or HuggingFace, the tool executes in under 100 milliseconds.
+It reads binary headers directly using the Python standard library. It skips the full tensor payload entirely (no PyTorch, no HuggingFace) and parses in under 100ms.
 ## Features
 - **Zero-Dependency Parsing**: Reads `.safetensors` 8-byte JSON prefixes and `.gguf` binary key-value metadata directly via `struct` and `json` (falling back to `config.json` if needed).
 - **Remote Hugging Face Hub Inspection**: Pass a repo ID (e.g., `meta-llama/Llama-2-7b-hf`) and it uses concurrent byte-range requests to read the headers off the CDN in under 2 seconds. No need to download the checkpoint.
 - Parses `model.safetensors.index.json` to support sharded models without crashing on partial downloads.
-- **Dynamic VRAM & Subtractive vLLM Math**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a subtractive "Serving Capacity" engine that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
+- **Dynamic VRAM & vLLM Capacity Planning**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a "Serving Capacity" simulation that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
 - **Hardware Fit Diagnostics**: Check if a model fits your cluster with `--gpu` (e.g. `--gpu RTX4090` or `--gpu auto`). It enforces Apple Silicon's 75% unified memory wire limit, and you can explicitly model multi-GPU NCCL communication penalties with `--topology` and `--strategy`.
 - **Side-by-Side Comparison**: Pass multiple models to trigger a comparison table (parameters, data types, context lengths, VRAM footprints).
 - Uses exact `ggml_type` mappings for GGUF formats to calculate byte-scaling coefficients, preventing VRAM under-reporting.
@@ -48,7 +48,7 @@ pip install -e ".[dev]"
 ## Testing
-The testing suite enforces cross-platform structural integrity and guards the zero-dependency latency constraint. Tests are isolated against custom binary mocks in `tests/fixtures/`.
+Tests cover the binary parsers and verify the sub-100ms local parse constraint using binary mocks in `tests/fixtures/`.
 Run the test suite using pytest:
@@ -147,7 +147,7 @@ Qwen2.5-0.5B       494.0M    BF16     8K         1.6 GB      ✓
 | `--gpu` | `--gpu rtx4090` | Check if the model fits. Accepts GPU names (`rtx4090`, `b200`, `rx7900xtx`), explicit VRAM limits in GB (`--gpu 24`), or local hardware auto-discovery (`--gpu auto`). |
 | `--context` | `--context 32768` | Adjust the target KV cache length. Essential for calculating the dynamic memory footprint of long-context models. Defaults to `8192`. |
 | `--max-vram` | `--max-vram 80` | Adjusts the color-coded heat mapping thresholds (Green/Yellow/Red) in the terminal output to match a specific hardware ceiling. |
-| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a subtractive serving capacity estimation. Shows exactly how many tokens fit in the PagedAttention pool. |
+| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a serving capacity simulation. Shows exactly how many tokens fit in the PagedAttention pool. |
 | `--gpu-util` | `--gpu-util 0.9` | Sets the vLLM `gpu_memory_utilization` ratio. Defaults to `0.9` (reserves 10% for PyTorch context). |
 | `--topology` | `--topology nvlink` | Set interconnect topology to calculate exact communication overhead penalties (`nvlink`, `pcie4`, `pcie3`). Defaults to `pcie4`. |
 | `--strategy` | `--strategy tp` | Selects the parallelization strategy for multi-GPU setups (`tp` for Tensor Parallelism, `pp` for Pipeline Parallelism). Defaults to `tp`. |

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "modelinfo-cli"
-version = "1.4.0"
-description = "A sub-100ms, zero-dependency CLI to inspect ML models (.safetensors, .gguf) locally or via Hugging Face, calculate exact VRAM footprints, and determine hardware fit."
+version = "1.4.2"
+description = "A CLI tool to inspect ML checkpoints (.safetensors, .gguf, .pt) and calculate inference VRAM, multi-GPU memory splits, and vLLM serving capacity."
 readme = "README.md"
 requires-python = ">=3.10"
 license = { text = "MIT" }

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/src/modelinfo/__init__.py RENAMED Viewed

@@ -2,4 +2,4 @@
 modelinfo - A high-performance CLI utility for inspecting ML model checkpoints.
 """
-__version__ = "1.4.0"
+__version__ = "1.4.2"

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/src/modelinfo/architecture.py RENAMED Viewed

@@ -1,5 +1,3 @@
-import os
-import json
 from typing import Any, Dict, Tuple
 def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None) -> Tuple[int, int, bool]:
@@ -10,16 +8,16 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
     num_layers = 0
     kv_dim = 0
     is_estimate = False
     metadata = tensors.get("__metadata__", {})
     gen_arch = metadata.get("general.architecture")
     # 1. Attempt explicit GGUF metadata
     if gen_arch:
         arch_str = str(gen_arch)
         num_layers = metadata.get(f"{arch_str}.block_count", 0)
         kv_heads = metadata.get(f"{arch_str}.attention.head_count_kv", 0)
         key_length = metadata.get(f"{arch_str}.attention.key_length")
         if not key_length:
             embed_len = metadata.get(f"{arch_str}.embedding_length", 0)
@@ -28,7 +26,7 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
                 key_length = embed_len // q_heads
             else:
                 key_length = 0
         if kv_heads > 0 and key_length > 0:
             kv_dim = kv_heads * key_length
             if num_layers > 0:
@@ -40,7 +38,7 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
         num_attention_heads = config.get("num_attention_heads", 1)
         num_key_value_heads = config.get("num_key_value_heads", num_attention_heads)
         hidden_size = config.get("hidden_size", 0)
         if num_attention_heads > 0:
             head_dim = hidden_size // num_attention_heads
             kv_dim = num_key_value_heads * head_dim
@@ -51,11 +49,11 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
     layers_set = set()
     found_fused = False
     found_k_proj = False
     for name, meta in tensors.items():
         if name == "__metadata__":
             continue
         parts = name.split(".")
         if "layers" in parts:
             idx = parts.index("layers")
@@ -71,7 +69,7 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
             shape = meta.get("shape", [])
             if len(shape) >= 2:
                 kv_dim = shape[0]
         if "qkv_proj.weight" in name or "c_attn.weight" in name:
             found_fused = True
             if not found_k_proj:
@@ -82,7 +80,7 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
     num_layers = len(layers_set)
     if found_fused and not found_k_proj and kv_dim > 0:
         is_estimate = True
     return num_layers, kv_dim, is_estimate
 def identify_architecture_name(tensors: Dict[str, Any], num_layers: int, config: Dict[str, Any] = None) -> str:
@@ -90,18 +88,18 @@ def identify_architecture_name(tensors: Dict[str, Any], num_layers: int, config:
     if config and "architectures" in config and config["architectures"]:
         arch_title = config["architectures"][0]
         return f"{arch_title} ({num_layers} layers)" if num_layers else arch_title
     metadata = tensors.get("__metadata__", {})
     gen_arch = metadata.get("general.architecture")
     if gen_arch:
         arch_title = str(gen_arch).title()
         return f"{arch_title} ({num_layers} transformer layers)" if num_layers else arch_title
     for name in tensors.keys():
         if name == "__metadata__":
             continue
         name_lower = name.lower()
         if "llama" in name_lower:
             return f"Llama ({num_layers} transformer layers)" if num_layers else "Llama"
@@ -109,5 +107,5 @@ def identify_architecture_name(tensors: Dict[str, Any], num_layers: int, config:
             return f"Mistral ({num_layers} transformer layers)" if num_layers else "Mistral"
         if "qwen" in name_lower:
             return f"Qwen ({num_layers} transformer layers)" if num_layers else "Qwen"
-    return f"Generic Transformer ({num_layers} layers)" if num_layers > 0 else "Unknown Architecture"
+    return f"Generic Transformer ({num_layers} layers)" if num_layers > 0 else "Unknown Architecture"

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/src/modelinfo/cli.py RENAMED Viewed

@@ -64,7 +64,7 @@ def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace:
     parser.add_argument(
         "--vllm",
         action="store_true",
-        help="Enable Subtractive Math Engine: Calculate max context tokens using vLLM PagedAttention allocation.",
+        help="Enable vLLM Capacity Simulation: Calculate max context tokens using PagedAttention allocation.",
     )
     parser.add_argument(
         "--gpu-util",
@@ -185,7 +185,7 @@ def main(argv: Sequence[str] | None = None) -> int:
     if len(args.file) > 1:
         if args.vllm:
-            console.print("[red]Error: Side-by-side comparison does not currently support the subtractive --vllm engine. Compare models sequentially or remove --vllm.[/red]")
+            console.print("[red]Error: Side-by-side comparison does not currently support the --vllm capacity simulation. Compare models sequentially or remove --vllm.[/red]")
             return 1
         models = []
@@ -207,7 +207,7 @@ def main(argv: Sequence[str] | None = None) -> int:
                 console.print(f"[red]Error analyzing model '{model_path}': {e}[/red]")
                 return 1
-        print_compare_info(models, args.max_vram, gpu_name=gpu_name_display)
+        print_compare_info(models, gpu_vram_gb if gpu_vram_gb else args.max_vram, gpu_name=gpu_name_display)
         return 0
     file_path = args.file[0]
@@ -228,7 +228,7 @@ def main(argv: Sequence[str] | None = None) -> int:
         console.print(f"[red]Error: {e}[/red]")
         return 1
-    print_model_info(**info, max_vram_gb=gpu_vram_gb if gpu_vram_gb else 8.0, gpu_name=gpu_name_display)
+    print_model_info(**info, max_vram_gb=gpu_vram_gb if gpu_vram_gb else args.max_vram, gpu_name=gpu_name_display)
     return 0

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/src/modelinfo/parsers/huggingface.py RENAMED Viewed

@@ -126,7 +126,7 @@ def fetch_huggingface_repo(repo_id: str, fetch_tensors: bool = False) -> Tuple[D
             def fetch_shard(shard: str):
                 return shard, _fetch_safetensors_header(repo_id, shard)
-            with concurrent.futures.ThreadPoolExecutor(max_workers=min(8, len(unique_shards))) as executor:
+            with concurrent.futures.ThreadPoolExecutor(max_workers=max(1, min(8, len(unique_shards)))) as executor:
                 future_to_shard = {executor.submit(fetch_shard, shard): shard for shard in unique_shards}
                 for future in concurrent.futures.as_completed(future_to_shard):
                     shard, shard_header = future.result()

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2/src/modelinfo_cli.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: modelinfo-cli
-Version: 1.4.0
-Summary: A sub-100ms, zero-dependency CLI to inspect ML models (.safetensors, .gguf) locally or via Hugging Face, calculate exact VRAM footprints, and determine hardware fit.
+Version: 1.4.2
+Summary: A CLI tool to inspect ML checkpoints (.safetensors, .gguf, .pt) and calculate inference VRAM, multi-GPU memory splits, and vLLM serving capacity.
 Author: ModelInfo Contributors
 License: MIT
 Requires-Python: >=3.10
@@ -26,14 +26,14 @@ Dynamic: license-file
 ModelInfo is a CLI tool that inspects machine learning model checkpoints (`.safetensors`, `.gguf`, `.pt`) and calculates hardware requirements completely offline.
-It reads binary headers directly using the Python standard library. By bypassing full tensor payload loading and strictly excluding heavy ecosystems like PyTorch or HuggingFace, the tool executes in under 100 milliseconds.
+It reads binary headers directly using the Python standard library. It skips the full tensor payload entirely (no PyTorch, no HuggingFace) and parses in under 100ms.
 ## Features
 - **Zero-Dependency Parsing**: Reads `.safetensors` 8-byte JSON prefixes and `.gguf` binary key-value metadata directly via `struct` and `json` (falling back to `config.json` if needed).
 - **Remote Hugging Face Hub Inspection**: Pass a repo ID (e.g., `meta-llama/Llama-2-7b-hf`) and it uses concurrent byte-range requests to read the headers off the CDN in under 2 seconds. No need to download the checkpoint.
 - Parses `model.safetensors.index.json` to support sharded models without crashing on partial downloads.
-- **Dynamic VRAM & Subtractive vLLM Math**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a subtractive "Serving Capacity" engine that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
+- **Dynamic VRAM & vLLM Capacity Planning**: Calculates exact VRAM limits based on the model's architecture and your target context length. If you use the `--vllm` flag, it switches to a "Serving Capacity" simulation that calculates exactly how many tokens fit in the PagedAttention pool based on your `--gpu-util` ratio.
 - **Hardware Fit Diagnostics**: Check if a model fits your cluster with `--gpu` (e.g. `--gpu RTX4090` or `--gpu auto`). It enforces Apple Silicon's 75% unified memory wire limit, and you can explicitly model multi-GPU NCCL communication penalties with `--topology` and `--strategy`.
 - **Side-by-Side Comparison**: Pass multiple models to trigger a comparison table (parameters, data types, context lengths, VRAM footprints).
 - Uses exact `ggml_type` mappings for GGUF formats to calculate byte-scaling coefficients, preventing VRAM under-reporting.
@@ -66,7 +66,7 @@ pip install -e ".[dev]"
 ## Testing
-The testing suite enforces cross-platform structural integrity and guards the zero-dependency latency constraint. Tests are isolated against custom binary mocks in `tests/fixtures/`.
+Tests cover the binary parsers and verify the sub-100ms local parse constraint using binary mocks in `tests/fixtures/`.
 Run the test suite using pytest:
@@ -165,7 +165,7 @@ Qwen2.5-0.5B       494.0M    BF16     8K         1.6 GB      ✓
 | `--gpu` | `--gpu rtx4090` | Check if the model fits. Accepts GPU names (`rtx4090`, `b200`, `rx7900xtx`), explicit VRAM limits in GB (`--gpu 24`), or local hardware auto-discovery (`--gpu auto`). |
 | `--context` | `--context 32768` | Adjust the target KV cache length. Essential for calculating the dynamic memory footprint of long-context models. Defaults to `8192`. |
 | `--max-vram` | `--max-vram 80` | Adjusts the color-coded heat mapping thresholds (Green/Yellow/Red) in the terminal output to match a specific hardware ceiling. |
-| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a subtractive serving capacity estimation. Shows exactly how many tokens fit in the PagedAttention pool. |
+| `--vllm` | `--vllm --gpu auto` | Switches from additive memory checking to a serving capacity simulation. Shows exactly how many tokens fit in the PagedAttention pool. |
 | `--gpu-util` | `--gpu-util 0.9` | Sets the vLLM `gpu_memory_utilization` ratio. Defaults to `0.9` (reserves 10% for PyTorch context). |
 | `--topology` | `--topology nvlink` | Set interconnect topology to calculate exact communication overhead penalties (`nvlink`, `pcie4`, `pcie3`). Defaults to `pcie4`. |
 | `--strategy` | `--strategy tp` | Selects the parallelization strategy for multi-GPU setups (`tp` for Tensor Parallelism, `pp` for Pipeline Parallelism). Defaults to `tp`. |

{modelinfo_cli-1.4.0 → modelinfo_cli-1.4.2}/tests/test_calculator.py RENAMED Viewed

@@ -143,8 +143,8 @@ def test_strategy_pp():
     assert fp_pp["penalty_percentage"] == 0.0
     assert fp_pp["overhead_bytes"] == (4 * 600 * 1024 * 1024)
-def test_vllm_subtractive_math():
-    """Verify the subtractive vLLM serving capacity engine calculates exact tokens."""
+def test_vllm_capacity_simulation():
+    """Verify the vLLM serving capacity engine calculates exact tokens."""
     tensors = {
         "model.layers.0.attn.weight": {"shape": [1024, 1024], "dtype": "F16"} # Base: 2MB
     }