modelinfo-cli 1.1.0__tar.gz → 1.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {modelinfo_cli-1.1.0/src/modelinfo_cli.egg-info → modelinfo_cli-1.2.0}/PKG-INFO +11 -4
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/README.md +10 -3
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/pyproject.toml +1 -1
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/__init__.py +1 -1
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/architecture.py +6 -2
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/calculator.py +5 -1
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/cli.py +27 -10
- modelinfo_cli-1.2.0/src/modelinfo/parsers/huggingface.py +151 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/ui.py +25 -7
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0/src/modelinfo_cli.egg-info}/PKG-INFO +11 -4
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo_cli.egg-info/SOURCES.txt +1 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/tests/test_calculator.py +11 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/LICENSE +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/setup.cfg +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/__main__.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/parsers/__init__.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/parsers/base.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/parsers/gguf.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/parsers/pytorch.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo/parsers/safetensors.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo_cli.egg-info/dependency_links.txt +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo_cli.egg-info/entry_points.txt +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo_cli.egg-info/requires.txt +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/src/modelinfo_cli.egg-info/top_level.txt +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/tests/test_constraints.py +0 -0
- {modelinfo_cli-1.1.0 → modelinfo_cli-1.2.0}/tests/test_parsers.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: modelinfo-cli
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.2.0
|
|
4
4
|
Summary: A sub-100ms, zero-dependency CLI tool to inspect ML model checkpoints and dynamically calculate VRAM requirements.
|
|
5
5
|
Author: ModelInfo Contributors
|
|
6
6
|
License: MIT
|
|
@@ -29,11 +29,12 @@ It reads binary headers directly using the Python standard library. By bypassing
|
|
|
29
29
|
## Features
|
|
30
30
|
|
|
31
31
|
- **Zero-Dependency Parsing**: Reads the 8-byte JSON prefix of `.safetensors` files and the binary key-value metadata of `.gguf` directly via `struct` and `json`. Seamlessly reads adjacent `config.json` for robust fallback logic.
|
|
32
|
+
- **Remote Hugging Face Hub Inspection**: Inspect any public or gated model directly via its repo ID (e.g., `modelinfo meta-llama/Llama-2-7b-hf`) without downloading the 15GB checkpoint. Uses concurrent byte-range requests to pluck the binary headers directly off the CDN in under 2 seconds.
|
|
32
33
|
- **Sharded Model Support**: Transparently parses `model.safetensors.index.json` to detect multi-file checkpoint distributions, gracefully guarding against partial downloads without crashing.
|
|
33
|
-
- **Dynamic VRAM Estimation**: Extracts underlying model architecture (layers, heads, dimensions) to calculate exact VRAM limits, including dynamic KV cache footprints based on user-specified context lengths.
|
|
34
|
+
- **Dynamic VRAM Estimation**: Extracts underlying model architecture (layers, heads, dimensions) to calculate exact VRAM limits, including dynamic KV cache footprints based on user-specified context lengths. Defaults to 8192 tokens to prevent unrealistic VRAM calculations, while still warning users if the requested context exceeds the model's native limit. Estimates include a standard 600MB CUDA context overhead.
|
|
34
35
|
- **Precise Block Quantization**: Factors in exact byte-scaling coefficients for GGUF formats (e.g., Q8, Q6, Q4) rather than naive averages, eliminating VRAM under-reporting.
|
|
35
36
|
- **Secure Pickling**: Inspects legacy `.pt` files without executing arbitrary code by using a highly restricted `pickle.Unpickler`.
|
|
36
|
-
- **Terminal UI**: Groups repetitive structural layers and color-codes VRAM heatmaps using `rich`.
|
|
37
|
+
- **Terminal UI**: Groups repetitive structural layers and color-codes VRAM heatmaps using `rich`. Breaks down memory footprints into Weights, KV Cache, and Overhead.
|
|
37
38
|
|
|
38
39
|
## Installation
|
|
39
40
|
|
|
@@ -67,12 +68,18 @@ pytest tests/ -v
|
|
|
67
68
|
|
|
68
69
|
## Usage
|
|
69
70
|
|
|
70
|
-
Inspect a model checkpoint:
|
|
71
|
+
Inspect a local model checkpoint:
|
|
71
72
|
|
|
72
73
|
```bash
|
|
73
74
|
modelinfo mistral-7b.safetensors
|
|
74
75
|
```
|
|
75
76
|
|
|
77
|
+
Inspect a remote model directly from the Hugging Face Hub:
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
modelinfo meta-llama/Llama-2-7b-hf
|
|
81
|
+
```
|
|
82
|
+
|
|
76
83
|
Calculate the memory footprint with a specific KV cache context window:
|
|
77
84
|
|
|
78
85
|
```bash
|
|
@@ -11,11 +11,12 @@ It reads binary headers directly using the Python standard library. By bypassing
|
|
|
11
11
|
## Features
|
|
12
12
|
|
|
13
13
|
- **Zero-Dependency Parsing**: Reads the 8-byte JSON prefix of `.safetensors` files and the binary key-value metadata of `.gguf` directly via `struct` and `json`. Seamlessly reads adjacent `config.json` for robust fallback logic.
|
|
14
|
+
- **Remote Hugging Face Hub Inspection**: Inspect any public or gated model directly via its repo ID (e.g., `modelinfo meta-llama/Llama-2-7b-hf`) without downloading the 15GB checkpoint. Uses concurrent byte-range requests to pluck the binary headers directly off the CDN in under 2 seconds.
|
|
14
15
|
- **Sharded Model Support**: Transparently parses `model.safetensors.index.json` to detect multi-file checkpoint distributions, gracefully guarding against partial downloads without crashing.
|
|
15
|
-
- **Dynamic VRAM Estimation**: Extracts underlying model architecture (layers, heads, dimensions) to calculate exact VRAM limits, including dynamic KV cache footprints based on user-specified context lengths.
|
|
16
|
+
- **Dynamic VRAM Estimation**: Extracts underlying model architecture (layers, heads, dimensions) to calculate exact VRAM limits, including dynamic KV cache footprints based on user-specified context lengths. Defaults to 8192 tokens to prevent unrealistic VRAM calculations, while still warning users if the requested context exceeds the model's native limit. Estimates include a standard 600MB CUDA context overhead.
|
|
16
17
|
- **Precise Block Quantization**: Factors in exact byte-scaling coefficients for GGUF formats (e.g., Q8, Q6, Q4) rather than naive averages, eliminating VRAM under-reporting.
|
|
17
18
|
- **Secure Pickling**: Inspects legacy `.pt` files without executing arbitrary code by using a highly restricted `pickle.Unpickler`.
|
|
18
|
-
- **Terminal UI**: Groups repetitive structural layers and color-codes VRAM heatmaps using `rich`.
|
|
19
|
+
- **Terminal UI**: Groups repetitive structural layers and color-codes VRAM heatmaps using `rich`. Breaks down memory footprints into Weights, KV Cache, and Overhead.
|
|
19
20
|
|
|
20
21
|
## Installation
|
|
21
22
|
|
|
@@ -49,12 +50,18 @@ pytest tests/ -v
|
|
|
49
50
|
|
|
50
51
|
## Usage
|
|
51
52
|
|
|
52
|
-
Inspect a model checkpoint:
|
|
53
|
+
Inspect a local model checkpoint:
|
|
53
54
|
|
|
54
55
|
```bash
|
|
55
56
|
modelinfo mistral-7b.safetensors
|
|
56
57
|
```
|
|
57
58
|
|
|
59
|
+
Inspect a remote model directly from the Hugging Face Hub:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
modelinfo meta-llama/Llama-2-7b-hf
|
|
63
|
+
```
|
|
64
|
+
|
|
58
65
|
Calculate the memory footprint with a specific KV cache context window:
|
|
59
66
|
|
|
60
67
|
```bash
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "modelinfo-cli"
|
|
7
|
-
version = "1.
|
|
7
|
+
version = "1.2.0"
|
|
8
8
|
description = "A sub-100ms, zero-dependency CLI tool to inspect ML model checkpoints and dynamically calculate VRAM requirements."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.10"
|
|
@@ -85,8 +85,12 @@ def extract_architecture(tensors: Dict[str, Any], config: Dict[str, Any] = None)
|
|
|
85
85
|
|
|
86
86
|
return num_layers, kv_dim, is_estimate
|
|
87
87
|
|
|
88
|
-
def identify_architecture_name(tensors: Dict[str, Any], num_layers: int) -> str:
|
|
89
|
-
"""Attempt to identify the architecture family based on tensor names or
|
|
88
|
+
def identify_architecture_name(tensors: Dict[str, Any], num_layers: int, config: Dict[str, Any] = None) -> str:
|
|
89
|
+
"""Attempt to identify the architecture family based on tensor names, metadata, or config.json."""
|
|
90
|
+
if config and "architectures" in config and config["architectures"]:
|
|
91
|
+
arch_title = config["architectures"][0]
|
|
92
|
+
return f"{arch_title} ({num_layers} layers)" if num_layers else arch_title
|
|
93
|
+
|
|
90
94
|
metadata = tensors.get("__metadata__", {})
|
|
91
95
|
gen_arch = metadata.get("general.architecture")
|
|
92
96
|
|
|
@@ -62,11 +62,15 @@ def calculate_footprint(tensors: Dict[str, Any], context_length: int = 0, batch_
|
|
|
62
62
|
|
|
63
63
|
primary_dtype = max(dtype_counts.items(), key=lambda x: x[1])[0] if dtype_counts else "Unknown"
|
|
64
64
|
|
|
65
|
+
CUDA_CONTEXT_MB = 600
|
|
66
|
+
overhead_bytes = CUDA_CONTEXT_MB * 1024 * 1024
|
|
67
|
+
|
|
65
68
|
return {
|
|
66
69
|
"total_params": total_params,
|
|
67
70
|
"base_memory_bytes": base_memory_bytes,
|
|
68
71
|
"kv_cache_bytes": kv_cache_bytes,
|
|
69
|
-
"
|
|
72
|
+
"overhead_bytes": overhead_bytes,
|
|
73
|
+
"total_memory_bytes": base_memory_bytes + kv_cache_bytes + overhead_bytes,
|
|
70
74
|
"num_layers": num_layers,
|
|
71
75
|
"kv_dim": kv_dim,
|
|
72
76
|
"primary_dtype": primary_dtype,
|
|
@@ -26,7 +26,7 @@ def parse_args(argv: Sequence[str] | None = None) -> argparse.Namespace:
|
|
|
26
26
|
parser.add_argument(
|
|
27
27
|
"--context",
|
|
28
28
|
type=int,
|
|
29
|
-
default=
|
|
29
|
+
default=None,
|
|
30
30
|
help="Context length for dynamic KV cache footprint calculation.",
|
|
31
31
|
)
|
|
32
32
|
|
|
@@ -40,7 +40,14 @@ def main(argv: Sequence[str] | None = None) -> int:
|
|
|
40
40
|
tensors = {}
|
|
41
41
|
config = None
|
|
42
42
|
|
|
43
|
-
if file_path.endswith(".safetensors"
|
|
43
|
+
if not os.path.exists(file_path) and not file_path.endswith((".safetensors", ".gguf", ".pt", ".bin", ".index.json")):
|
|
44
|
+
from modelinfo.parsers.huggingface import fetch_huggingface_repo
|
|
45
|
+
try:
|
|
46
|
+
tensors, config, format_name, disk_size = fetch_huggingface_repo(args.file)
|
|
47
|
+
except Exception as e:
|
|
48
|
+
console.print(f"[red]Error fetching from Hugging Face: {e}[/red]")
|
|
49
|
+
return 1
|
|
50
|
+
elif file_path.endswith(".safetensors") or file_path.endswith(".index.json"):
|
|
44
51
|
tensors = parse_safetensors_header(args.file)
|
|
45
52
|
format_name = "SafeTensors"
|
|
46
53
|
|
|
@@ -61,24 +68,33 @@ def main(argv: Sequence[str] | None = None) -> int:
|
|
|
61
68
|
format_name = "PyTorch"
|
|
62
69
|
else:
|
|
63
70
|
console.print(
|
|
64
|
-
f"[red]Error:
|
|
71
|
+
f"[red]Error: File '{args.file}' not found locally and does not appear to be a Hugging Face repository ID.[/red]"
|
|
65
72
|
)
|
|
66
73
|
return 1
|
|
67
74
|
|
|
68
|
-
footprint = calculate_footprint(tensors, context_length=args.context, config=config)
|
|
69
|
-
num_layers = footprint["num_layers"]
|
|
70
|
-
arch_name = identify_architecture_name(tensors, num_layers)
|
|
71
|
-
|
|
72
75
|
max_context = None
|
|
73
76
|
if config:
|
|
74
77
|
max_context = config.get("max_position_embeddings")
|
|
75
|
-
|
|
78
|
+
elif format_name == "GGUF":
|
|
76
79
|
metadata = tensors.get("__metadata__", {})
|
|
77
80
|
gen_arch = metadata.get("general.architecture")
|
|
78
81
|
if gen_arch:
|
|
79
82
|
max_context = metadata.get(f"{gen_arch}.context_length")
|
|
83
|
+
|
|
84
|
+
# Determine the actual context length to use for calculation
|
|
85
|
+
is_default_context = False
|
|
86
|
+
context_length = args.context
|
|
87
|
+
if context_length is None:
|
|
88
|
+
context_length = min(8192, max_context) if max_context else 8192
|
|
89
|
+
is_default_context = True
|
|
80
90
|
|
|
81
|
-
|
|
91
|
+
footprint = calculate_footprint(tensors, context_length=context_length, config=config)
|
|
92
|
+
num_layers = footprint["num_layers"]
|
|
93
|
+
arch_name = identify_architecture_name(tensors, num_layers, config)
|
|
94
|
+
|
|
95
|
+
if format_name != "SafeTensors" or os.path.exists(args.file):
|
|
96
|
+
disk_size = os.path.getsize(args.file) if os.path.exists(args.file) else 0.0
|
|
97
|
+
|
|
82
98
|
tensor_count = len([k for k in tensors.keys() if k != "__metadata__"])
|
|
83
99
|
|
|
84
100
|
print_model_info(
|
|
@@ -87,7 +103,8 @@ def main(argv: Sequence[str] | None = None) -> int:
|
|
|
87
103
|
tensor_count=tensor_count,
|
|
88
104
|
footprint=footprint,
|
|
89
105
|
disk_size=disk_size,
|
|
90
|
-
context_length=
|
|
106
|
+
context_length=context_length,
|
|
107
|
+
is_default_context=is_default_context,
|
|
91
108
|
tensors=tensors,
|
|
92
109
|
max_context=max_context
|
|
93
110
|
)
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
import concurrent.futures
|
|
2
|
+
import json
|
|
3
|
+
import os
|
|
4
|
+
import struct
|
|
5
|
+
import urllib.error
|
|
6
|
+
import urllib.request
|
|
7
|
+
from typing import Any, Dict, Tuple
|
|
8
|
+
|
|
9
|
+
def _get_hf_token() -> str | None:
|
|
10
|
+
token = os.environ.get("HF_TOKEN")
|
|
11
|
+
if token:
|
|
12
|
+
return token
|
|
13
|
+
|
|
14
|
+
cache_path = os.path.expanduser("~/.cache/huggingface/token")
|
|
15
|
+
if os.path.exists(cache_path):
|
|
16
|
+
try:
|
|
17
|
+
with open(cache_path, "r", encoding="utf-8") as f:
|
|
18
|
+
return f.read().strip()
|
|
19
|
+
except OSError:
|
|
20
|
+
pass
|
|
21
|
+
|
|
22
|
+
legacy_path = os.path.expanduser("~/.huggingface/token")
|
|
23
|
+
if os.path.exists(legacy_path):
|
|
24
|
+
try:
|
|
25
|
+
with open(legacy_path, "r", encoding="utf-8") as f:
|
|
26
|
+
return f.read().strip()
|
|
27
|
+
except OSError:
|
|
28
|
+
pass
|
|
29
|
+
|
|
30
|
+
return None
|
|
31
|
+
|
|
32
|
+
def _make_request(url: str, headers: Dict[str, str] = None) -> bytes:
|
|
33
|
+
if headers is None:
|
|
34
|
+
headers = {}
|
|
35
|
+
|
|
36
|
+
token = _get_hf_token()
|
|
37
|
+
if token:
|
|
38
|
+
headers["Authorization"] = f"Bearer {token}"
|
|
39
|
+
|
|
40
|
+
req = urllib.request.Request(url, headers=headers)
|
|
41
|
+
try:
|
|
42
|
+
with urllib.request.urlopen(req, timeout=10) as response:
|
|
43
|
+
return response.read()
|
|
44
|
+
except urllib.error.HTTPError as e:
|
|
45
|
+
if e.code == 401:
|
|
46
|
+
raise PermissionError(f"🔒 Gated Model or Invalid Token: Please set HF_TOKEN environment variable to access {url}")
|
|
47
|
+
if e.code == 404:
|
|
48
|
+
raise FileNotFoundError(f"File not found on Hugging Face Hub: {url}")
|
|
49
|
+
raise
|
|
50
|
+
|
|
51
|
+
def _fetch_safetensors_header(repo_id: str, filename: str) -> Dict[str, Any]:
|
|
52
|
+
url = f"https://huggingface.co/{repo_id}/resolve/main/{filename}"
|
|
53
|
+
|
|
54
|
+
# 1. Fetch the first 500KB in a single roundtrip
|
|
55
|
+
headers = {"Range": "bytes=0-500000"}
|
|
56
|
+
try:
|
|
57
|
+
chunk = _make_request(url, headers=headers)
|
|
58
|
+
except urllib.error.HTTPError as e:
|
|
59
|
+
if e.code == 416: # Range Not Satisfiable (file is smaller than 500KB)
|
|
60
|
+
chunk = _make_request(url)
|
|
61
|
+
else:
|
|
62
|
+
raise
|
|
63
|
+
|
|
64
|
+
if len(chunk) < 8:
|
|
65
|
+
raise ValueError(f"File {filename} is too small to contain a SafeTensors header.")
|
|
66
|
+
|
|
67
|
+
header_size = struct.unpack("<Q", chunk[:8])[0]
|
|
68
|
+
|
|
69
|
+
# 2. Slice locally if it fits
|
|
70
|
+
if 8 + header_size <= len(chunk):
|
|
71
|
+
json_bytes = chunk[8:8+header_size]
|
|
72
|
+
else:
|
|
73
|
+
# 3. Double-roundtrip only if the header is massive (>500KB)
|
|
74
|
+
headers = {"Range": f"bytes=8-{8+header_size-1}"}
|
|
75
|
+
json_bytes = _make_request(url, headers=headers)
|
|
76
|
+
|
|
77
|
+
return json.loads(json_bytes)
|
|
78
|
+
|
|
79
|
+
def fetch_huggingface_repo(repo_id: str) -> Tuple[Dict[str, Any], Dict[str, Any] | None, str, float]:
|
|
80
|
+
"""
|
|
81
|
+
Fetches the metadata directly from the Hugging Face Hub over the network.
|
|
82
|
+
Returns: (tensors, config, format_name, disk_size)
|
|
83
|
+
"""
|
|
84
|
+
api_url = f"https://huggingface.co/api/models/{repo_id}"
|
|
85
|
+
try:
|
|
86
|
+
api_data = json.loads(_make_request(api_url).decode("utf-8"))
|
|
87
|
+
except urllib.error.HTTPError as e:
|
|
88
|
+
if e.code == 401:
|
|
89
|
+
raise PermissionError(f"🔒 Gated Model: Please set HF_TOKEN environment variable to access {repo_id}")
|
|
90
|
+
raise
|
|
91
|
+
|
|
92
|
+
siblings = api_data.get("siblings", [])
|
|
93
|
+
filenames = {s["rfilename"] for s in siblings}
|
|
94
|
+
|
|
95
|
+
config = None
|
|
96
|
+
if "config.json" in filenames:
|
|
97
|
+
config_url = f"https://huggingface.co/{repo_id}/resolve/main/config.json"
|
|
98
|
+
config = json.loads(_make_request(config_url).decode("utf-8"))
|
|
99
|
+
|
|
100
|
+
tensors = {}
|
|
101
|
+
total_size = 0.0
|
|
102
|
+
|
|
103
|
+
if "model.safetensors.index.json" in filenames:
|
|
104
|
+
# Sharded SafeTensors
|
|
105
|
+
index_url = f"https://huggingface.co/{repo_id}/resolve/main/model.safetensors.index.json"
|
|
106
|
+
index_data = json.loads(_make_request(index_url).decode("utf-8"))
|
|
107
|
+
|
|
108
|
+
weight_map = index_data.get("weight_map", {})
|
|
109
|
+
unique_shards = list(set(weight_map.values()))
|
|
110
|
+
|
|
111
|
+
total_size = index_data.get("metadata", {}).get("total_size", 0.0)
|
|
112
|
+
|
|
113
|
+
def fetch_shard(shard: str):
|
|
114
|
+
return shard, _fetch_safetensors_header(repo_id, shard)
|
|
115
|
+
|
|
116
|
+
with concurrent.futures.ThreadPoolExecutor(max_workers=min(8, len(unique_shards))) as executor:
|
|
117
|
+
future_to_shard = {executor.submit(fetch_shard, shard): shard for shard in unique_shards}
|
|
118
|
+
for future in concurrent.futures.as_completed(future_to_shard):
|
|
119
|
+
shard, shard_header = future.result()
|
|
120
|
+
for k, v in shard_header.items():
|
|
121
|
+
if k != "__metadata__":
|
|
122
|
+
tensors[k] = v
|
|
123
|
+
|
|
124
|
+
tensors["__metadata__"] = {
|
|
125
|
+
"missing_shards": 0,
|
|
126
|
+
"total_shards": len(unique_shards),
|
|
127
|
+
"is_sharded": True
|
|
128
|
+
}
|
|
129
|
+
format_name = "SafeTensors"
|
|
130
|
+
|
|
131
|
+
elif "model.safetensors" in filenames:
|
|
132
|
+
# Single SafeTensors
|
|
133
|
+
header = _fetch_safetensors_header(repo_id, "model.safetensors")
|
|
134
|
+
tensors = header
|
|
135
|
+
format_name = "SafeTensors"
|
|
136
|
+
|
|
137
|
+
# We don't have total_size from index, so we could get it from Content-Length or just leave it 0
|
|
138
|
+
req = urllib.request.Request(f"https://huggingface.co/{repo_id}/resolve/main/model.safetensors", method="HEAD")
|
|
139
|
+
token = _get_hf_token()
|
|
140
|
+
if token:
|
|
141
|
+
req.add_header("Authorization", f"Bearer {token}")
|
|
142
|
+
try:
|
|
143
|
+
with urllib.request.urlopen(req) as response:
|
|
144
|
+
total_size = int(response.headers.get("Content-Length", 0))
|
|
145
|
+
except Exception:
|
|
146
|
+
pass
|
|
147
|
+
|
|
148
|
+
else:
|
|
149
|
+
raise ValueError(f"Repository {repo_id} does not contain SafeTensors weights.")
|
|
150
|
+
|
|
151
|
+
return tensors, config, format_name, float(total_size)
|
|
@@ -43,6 +43,7 @@ def print_model_info(
|
|
|
43
43
|
footprint: Dict[str, Any],
|
|
44
44
|
disk_size: float,
|
|
45
45
|
context_length: int,
|
|
46
|
+
is_default_context: bool,
|
|
46
47
|
tensors: Dict[str, Any],
|
|
47
48
|
max_context: int | None = None
|
|
48
49
|
) -> None:
|
|
@@ -58,21 +59,38 @@ def print_model_info(
|
|
|
58
59
|
param_text = "[yellow]UNKNOWN (Missing Shards)[/yellow]"
|
|
59
60
|
disk_text = "[yellow]UNKNOWN (Missing Shards)[/yellow]"
|
|
60
61
|
vram_display = "[yellow]UNKNOWN (Missing Shards)[/yellow]"
|
|
62
|
+
elif footprint["total_memory_bytes"] == 0:
|
|
63
|
+
param_text = format_params(footprint["total_params"])
|
|
64
|
+
disk_text = format_bytes(disk_size)
|
|
65
|
+
vram_display = "[red]Unknown (Missing Tensor Shapes)[/red]"
|
|
61
66
|
else:
|
|
62
67
|
param_text = format_params(footprint["total_params"])
|
|
63
68
|
disk_text = format_bytes(disk_size)
|
|
64
69
|
vram_bytes = footprint["total_memory_bytes"]
|
|
65
70
|
vram_color = "green" if vram_bytes < 8 * 1024**3 else "yellow" if vram_bytes < 16 * 1024**3 else "red"
|
|
66
71
|
|
|
67
|
-
vram_text = f"~{format_bytes(vram_bytes)}"
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
72
|
+
vram_text = f"~{format_bytes(vram_bytes)} Total Minimum Required"
|
|
73
|
+
vram_display = f"[{vram_color}]{vram_text}[/{vram_color}]\n"
|
|
74
|
+
|
|
75
|
+
weights_bytes = footprint["base_memory_bytes"]
|
|
76
|
+
vram_display += f" ├─ Weights: {format_bytes(weights_bytes)}\n"
|
|
77
|
+
|
|
78
|
+
kv_cache_bytes = footprint["kv_cache_bytes"]
|
|
79
|
+
|
|
80
|
+
if footprint.get("kv_is_estimate"):
|
|
81
|
+
kv_note = " (Estimated KV Cache - Missing Config)"
|
|
82
|
+
elif is_default_context:
|
|
83
|
+
if max_context and max_context > context_length:
|
|
84
|
+
kv_note = f" (Default {context_length} tokens. Native limit: {max_context:,})"
|
|
71
85
|
else:
|
|
72
|
-
|
|
86
|
+
kv_note = f" (Default {context_length} tokens)"
|
|
73
87
|
else:
|
|
74
|
-
|
|
75
|
-
|
|
88
|
+
kv_note = f" ({context_length} tokens)"
|
|
89
|
+
|
|
90
|
+
vram_display += f" ├─ KV Cache: {format_bytes(kv_cache_bytes)}{kv_note}\n"
|
|
91
|
+
|
|
92
|
+
overhead_bytes = footprint.get("overhead_bytes", 600 * 1024 * 1024)
|
|
93
|
+
vram_display += f" └─ Overhead: {format_bytes(overhead_bytes)} (CUDA Context + Activations)"
|
|
76
94
|
|
|
77
95
|
summary.add_row("Format:", format_name)
|
|
78
96
|
summary.add_row("Architecture:", arch_name)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: modelinfo-cli
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.2.0
|
|
4
4
|
Summary: A sub-100ms, zero-dependency CLI tool to inspect ML model checkpoints and dynamically calculate VRAM requirements.
|
|
5
5
|
Author: ModelInfo Contributors
|
|
6
6
|
License: MIT
|
|
@@ -29,11 +29,12 @@ It reads binary headers directly using the Python standard library. By bypassing
|
|
|
29
29
|
## Features
|
|
30
30
|
|
|
31
31
|
- **Zero-Dependency Parsing**: Reads the 8-byte JSON prefix of `.safetensors` files and the binary key-value metadata of `.gguf` directly via `struct` and `json`. Seamlessly reads adjacent `config.json` for robust fallback logic.
|
|
32
|
+
- **Remote Hugging Face Hub Inspection**: Inspect any public or gated model directly via its repo ID (e.g., `modelinfo meta-llama/Llama-2-7b-hf`) without downloading the 15GB checkpoint. Uses concurrent byte-range requests to pluck the binary headers directly off the CDN in under 2 seconds.
|
|
32
33
|
- **Sharded Model Support**: Transparently parses `model.safetensors.index.json` to detect multi-file checkpoint distributions, gracefully guarding against partial downloads without crashing.
|
|
33
|
-
- **Dynamic VRAM Estimation**: Extracts underlying model architecture (layers, heads, dimensions) to calculate exact VRAM limits, including dynamic KV cache footprints based on user-specified context lengths.
|
|
34
|
+
- **Dynamic VRAM Estimation**: Extracts underlying model architecture (layers, heads, dimensions) to calculate exact VRAM limits, including dynamic KV cache footprints based on user-specified context lengths. Defaults to 8192 tokens to prevent unrealistic VRAM calculations, while still warning users if the requested context exceeds the model's native limit. Estimates include a standard 600MB CUDA context overhead.
|
|
34
35
|
- **Precise Block Quantization**: Factors in exact byte-scaling coefficients for GGUF formats (e.g., Q8, Q6, Q4) rather than naive averages, eliminating VRAM under-reporting.
|
|
35
36
|
- **Secure Pickling**: Inspects legacy `.pt` files without executing arbitrary code by using a highly restricted `pickle.Unpickler`.
|
|
36
|
-
- **Terminal UI**: Groups repetitive structural layers and color-codes VRAM heatmaps using `rich`.
|
|
37
|
+
- **Terminal UI**: Groups repetitive structural layers and color-codes VRAM heatmaps using `rich`. Breaks down memory footprints into Weights, KV Cache, and Overhead.
|
|
37
38
|
|
|
38
39
|
## Installation
|
|
39
40
|
|
|
@@ -67,12 +68,18 @@ pytest tests/ -v
|
|
|
67
68
|
|
|
68
69
|
## Usage
|
|
69
70
|
|
|
70
|
-
Inspect a model checkpoint:
|
|
71
|
+
Inspect a local model checkpoint:
|
|
71
72
|
|
|
72
73
|
```bash
|
|
73
74
|
modelinfo mistral-7b.safetensors
|
|
74
75
|
```
|
|
75
76
|
|
|
77
|
+
Inspect a remote model directly from the Hugging Face Hub:
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
modelinfo meta-llama/Llama-2-7b-hf
|
|
81
|
+
```
|
|
82
|
+
|
|
76
83
|
Calculate the memory footprint with a specific KV cache context window:
|
|
77
84
|
|
|
78
85
|
```bash
|
|
@@ -10,6 +10,7 @@ src/modelinfo/ui.py
|
|
|
10
10
|
src/modelinfo/parsers/__init__.py
|
|
11
11
|
src/modelinfo/parsers/base.py
|
|
12
12
|
src/modelinfo/parsers/gguf.py
|
|
13
|
+
src/modelinfo/parsers/huggingface.py
|
|
13
14
|
src/modelinfo/parsers/pytorch.py
|
|
14
15
|
src/modelinfo/parsers/safetensors.py
|
|
15
16
|
src/modelinfo_cli.egg-info/PKG-INFO
|
|
@@ -99,3 +99,14 @@ def test_kv_cache_is_fp16():
|
|
|
99
99
|
|
|
100
100
|
assert footprint["kv_cache_bytes"] == 33554432
|
|
101
101
|
assert footprint["primary_dtype"] == "Q4"
|
|
102
|
+
|
|
103
|
+
def test_framework_overhead_included():
|
|
104
|
+
"""Verify that CUDA context and activation overhead is correctly included."""
|
|
105
|
+
tensors = {
|
|
106
|
+
"model.layers.0.attn.weight": {"shape": [1024, 1024], "dtype": "F16"}
|
|
107
|
+
}
|
|
108
|
+
footprint = calculate_footprint(tensors)
|
|
109
|
+
|
|
110
|
+
assert "overhead_bytes" in footprint
|
|
111
|
+
assert footprint["overhead_bytes"] == 600 * 1024 * 1024
|
|
112
|
+
assert footprint["total_memory_bytes"] == footprint["base_memory_bytes"] + footprint["kv_cache_bytes"] + footprint["overhead_bytes"]
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|