vllm-wizard 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (82) hide show
  1. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/API Reference.md +564 -0
  2. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/CLI Integration API.md +426 -0
  3. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Hardware Detection API.md +633 -0
  4. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Planning Engine API/Memory Calculation API.md +394 -0
  5. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Planning Engine API/Performance Estimation API.md +326 -0
  6. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Planning Engine API/Planner Core.md +455 -0
  7. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Planning Engine API/Planning Engine API.md +472 -0
  8. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Planning Engine API/Recommendation System API.md +444 -0
  9. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Rendering System API/Command Generation API.md +441 -0
  10. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Rendering System API/Kubernetes Deployment API.md +315 -0
  11. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Rendering System API/Profile Serialization API.md +453 -0
  12. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Rendering System API/Rendering System API.md +474 -0
  13. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Rendering System API/Report Generation API.md +423 -0
  14. vllm_wizard-0.1.0/.qoder/repowiki/en/content/API Reference/Schema Models.md +427 -0
  15. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Advanced Usage and Integration/Advanced Usage and Integration.md +483 -0
  16. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Advanced Usage and Integration/Automation and Scripting.md +536 -0
  17. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Advanced Usage and Integration/Custom Plugins and Extensions.md +499 -0
  18. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Advanced Usage and Integration/Integration Patterns.md +402 -0
  19. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Advanced Usage and Integration/Performance Tuning and Optimization.md +462 -0
  20. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Artifact Generation System/Artifact Generation System.md +371 -0
  21. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Artifact Generation System/Docker Artifact Generation.md +356 -0
  22. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Artifact Generation System/Kubernetes Artifact Generation.md +348 -0
  23. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Artifact Generation System/YAML Profile Serialization.md +384 -0
  24. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Artifact Generation System/vLLM Serve Command Generation.md +287 -0
  25. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Command Reference/Command Reference.md +487 -0
  26. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Command Reference/Detect Command.md +313 -0
  27. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Command Reference/Generate Command.md +381 -0
  28. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Command Reference/Plan Command.md +414 -0
  29. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Configuration Management.md +437 -0
  30. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Core Concepts/Core Concepts.md +441 -0
  31. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Core Concepts/GPU Detection and Hardware Compatibility.md +391 -0
  32. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Core Concepts/Model Metadata Processing.md +350 -0
  33. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Core Concepts/Performance Estimation Methodology.md +400 -0
  34. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Core Concepts/VRAM Memory Breakdown.md +409 -0
  35. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Developer Guide.md +558 -0
  36. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Installation and Setup.md +304 -0
  37. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Memory Calculation Framework/Feasibility Analysis and Risk Assessment.md +351 -0
  38. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Memory Calculation Framework/KV Cache Computation.md +318 -0
  39. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Memory Calculation Framework/Memory Calculation Framework.md +475 -0
  40. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Memory Calculation Framework/Overhead Estimation.md +306 -0
  41. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Memory Calculation Framework/Weights Memory Calculation.md +317 -0
  42. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Performance Estimation.md +324 -0
  43. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Architecture and Design.md +512 -0
  44. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Core Features Overview/Configuration Generation.md +428 -0
  45. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Core Features Overview/Core Features Overview.md +365 -0
  46. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Core Features Overview/GPU Detection System.md +434 -0
  47. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Core Features Overview/Performance Estimation.md +331 -0
  48. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Core Features Overview/VRAM Feasibility Analysis.md +465 -0
  49. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Installation and Setup.md +171 -0
  50. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Project Overview.md +483 -0
  51. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Project Overview/Quick Start Guide.md +509 -0
  52. vllm_wizard-0.1.0/.qoder/repowiki/en/content/Quick Start Guide.md +406 -0
  53. vllm_wizard-0.1.0/.qoder/repowiki/en/meta/repowiki-metadata.json +1 -0
  54. vllm_wizard-0.1.0/LICENSE +201 -0
  55. vllm_wizard-0.1.0/PKG-INFO +344 -0
  56. vllm_wizard-0.1.0/README.md +307 -0
  57. vllm_wizard-0.1.0/examples/profiles/sample.yaml +39 -0
  58. vllm_wizard-0.1.0/pyproject.toml +70 -0
  59. vllm_wizard-0.1.0/requirements.txt +504 -0
  60. vllm_wizard-0.1.0/src/vllm_wizard/__init__.py +3 -0
  61. vllm_wizard-0.1.0/src/vllm_wizard/cli.py +384 -0
  62. vllm_wizard-0.1.0/src/vllm_wizard/hardware/__init__.py +8 -0
  63. vllm_wizard-0.1.0/src/vllm_wizard/hardware/detect.py +228 -0
  64. vllm_wizard-0.1.0/src/vllm_wizard/models/__init__.py +13 -0
  65. vllm_wizard-0.1.0/src/vllm_wizard/models/metadata.py +254 -0
  66. vllm_wizard-0.1.0/src/vllm_wizard/planning/__init__.py +31 -0
  67. vllm_wizard-0.1.0/src/vllm_wizard/planning/memory.py +366 -0
  68. vllm_wizard-0.1.0/src/vllm_wizard/planning/perf.py +219 -0
  69. vllm_wizard-0.1.0/src/vllm_wizard/planning/planner.py +171 -0
  70. vllm_wizard-0.1.0/src/vllm_wizard/planning/recommend.py +332 -0
  71. vllm_wizard-0.1.0/src/vllm_wizard/render/__init__.py +19 -0
  72. vllm_wizard-0.1.0/src/vllm_wizard/render/commands.py +195 -0
  73. vllm_wizard-0.1.0/src/vllm_wizard/render/profile.py +172 -0
  74. vllm_wizard-0.1.0/src/vllm_wizard/render/report.py +271 -0
  75. vllm_wizard-0.1.0/src/vllm_wizard/schemas/__init__.py +49 -0
  76. vllm_wizard-0.1.0/src/vllm_wizard/schemas/inputs.py +109 -0
  77. vllm_wizard-0.1.0/src/vllm_wizard/schemas/outputs.py +117 -0
  78. vllm_wizard-0.1.0/src/vllm_wizard/schemas/profile.py +74 -0
  79. vllm_wizard-0.1.0/tests/conftest.py +125 -0
  80. vllm_wizard-0.1.0/tests/test_cli.py +191 -0
  81. vllm_wizard-0.1.0/tests/test_memory.py +264 -0
  82. vllm_wizard-0.1.0/tests/test_models.py +112 -0
@@ -0,0 +1,564 @@
1
+ # API Reference
2
+
3
+ <cite>
4
+ **Referenced Files in This Document**
5
+ - [__init__.py](file://src/vllm_wizard/__init__.py)
6
+ - [cli.py](file://src/vllm_wizard/cli.py)
7
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py)
8
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py)
9
+ - [schemas/profile.py](file://src/vllm_wizard/schemas/profile.py)
10
+ - [planning/__init__.py](file://src/vllm_wizard/planning/__init__.py)
11
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py)
12
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py)
13
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py)
14
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py)
15
+ - [hardware/__init__.py](file://src/vllm_wizard/hardware/__init__.py)
16
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py)
17
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py)
18
+ - [render/__init__.py](file://src/vllm_wizard/render/__init__.py)
19
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py)
20
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
21
+ - [render/report.py](file://src/vllm_wizard/render/report.py)
22
+ </cite>
23
+
24
+ ## Table of Contents
25
+ 1. [Introduction](#introduction)
26
+ 2. [Project Structure](#project-structure)
27
+ 3. [Core Components](#core-components)
28
+ 4. [Architecture Overview](#architecture-overview)
29
+ 5. [Detailed Component Analysis](#detailed-component-analysis)
30
+ 6. [Dependency Analysis](#dependency-analysis)
31
+ 7. [Performance Considerations](#performance-considerations)
32
+ 8. [Troubleshooting Guide](#troubleshooting-guide)
33
+ 9. [Conclusion](#conclusion)
34
+ 10. [Appendices](#appendices)
35
+
36
+ ## Introduction
37
+ This API reference documents the internal Python API of the vLLM Wizard project. It covers the planning engine, hardware detection, model metadata extraction, and rendering system. It also defines input and output schemas, enumerations, and public functions exposed by the package’s modules. Guidance is included for programmatic usage, integration patterns, error handling, validation, type annotations, and extension points.
38
+
39
+ ## Project Structure
40
+ The project is organized into cohesive modules:
41
+ - CLI entrypoints and commands
42
+ - Planning engine (memory feasibility, performance estimation, recommendations)
43
+ - Hardware detection and tensor-parallel recommendation
44
+ - Model metadata extraction from local or Hugging Face repositories
45
+ - Rendering of commands, profiles, and reports
46
+ - Pydantic-based input/output schemas
47
+
48
+ ```mermaid
49
+ graph TB
50
+ subgraph "CLI"
51
+ CLI["cli.py"]
52
+ end
53
+ subgraph "Planning"
54
+ PLAN_PKG["planning/__init__.py"]
55
+ PLAN_RUN["planning/planner.py"]
56
+ MEM["planning/memory.py"]
57
+ PERF["planning/perf.py"]
58
+ REC["planning/recommend.py"]
59
+ end
60
+ subgraph "Hardware"
61
+ HW_PKG["hardware/__init__.py"]
62
+ DETECT["hardware/detect.py"]
63
+ end
64
+ subgraph "Models"
65
+ META["models/metadata.py"]
66
+ end
67
+ subgraph "Render"
68
+ REND_PKG["render/__init__.py"]
69
+ CMDS["render/commands.py"]
70
+ PROF_R["render/profile.py"]
71
+ REP["render/report.py"]
72
+ end
73
+ subgraph "Schemas"
74
+ INP["schemas/inputs.py"]
75
+ OUT["schemas/outputs.py"]
76
+ PROF_S["schemas/profile.py"]
77
+ end
78
+ CLI --> PLAN_RUN
79
+ PLAN_RUN --> DETECT
80
+ PLAN_RUN --> META
81
+ PLAN_RUN --> MEM
82
+ PLAN_RUN --> PERF
83
+ PLAN_RUN --> REC
84
+ PLAN_RUN --> CMDS
85
+ CLI --> INP
86
+ CLI --> OUT
87
+ CLI --> PROF_R
88
+ CLI --> REP
89
+ PLAN_PKG --> MEM
90
+ PLAN_PKG --> PERF
91
+ PLAN_PKG --> REC
92
+ PLAN_PKG --> PLAN_RUN
93
+ HW_PKG --> DETECT
94
+ REND_PKG --> CMDS
95
+ REND_PKG --> PROF_R
96
+ REND_PKG --> REP
97
+ ```
98
+
99
+ **Diagram sources**
100
+ - [cli.py](file://src/vllm_wizard/cli.py#L1-L385)
101
+ - [planning/__init__.py](file://src/vllm_wizard/planning/__init__.py#L1-L32)
102
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L1-L172)
103
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L1-L367)
104
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L1-L220)
105
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L1-L333)
106
+ - [hardware/__init__.py](file://src/vllm_wizard/hardware/__init__.py#L1-L9)
107
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L1-L229)
108
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L1-L255)
109
+ - [render/__init__.py](file://src/vllm_wizard/render/__init__.py#L1-L20)
110
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L1-L196)
111
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
112
+ - [render/report.py](file://src/vllm_wizard/render/report.py)
113
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py#L1-L110)
114
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L1-L118)
115
+ - [schemas/profile.py](file://src/vllm_wizard/schemas/profile.py#L1-L75)
116
+
117
+ **Section sources**
118
+ - [cli.py](file://src/vllm_wizard/cli.py#L1-L385)
119
+ - [planning/__init__.py](file://src/vllm_wizard/planning/__init__.py#L1-L32)
120
+ - [hardware/__init__.py](file://src/vllm_wizard/hardware/__init__.py#L1-L9)
121
+ - [render/__init__.py](file://src/vllm_wizard/render/__init__.py#L1-L20)
122
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py#L1-L110)
123
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L1-L118)
124
+ - [schemas/profile.py](file://src/vllm_wizard/schemas/profile.py#L1-L75)
125
+
126
+ ## Core Components
127
+ This section summarizes the primary public APIs grouped by module.
128
+
129
+ - CLI Commands
130
+ - detect: Lists detected GPUs or prints as JSON
131
+ - plan: Builds a PlanRequest, runs planning, and renders a console report or JSON
132
+ - generate: Builds a PlanRequest, runs planning, and writes artifacts to disk
133
+
134
+ - Planning Engine
135
+ - run_plan: Orchestrates metadata loading, hardware resolution, memory feasibility, recommendations, performance estimation, and artifact generation
136
+ - compute_* helpers: Memory computations and feasibility checks
137
+ - estimate_performance: Heuristic performance estimation
138
+ - generate_recommendations: Produces a VLLMConfig with tuned parameters
139
+
140
+ - Hardware Detection
141
+ - detect_gpus: Queries nvidia-smi and enriches GPUInfo
142
+ - recommend_tensor_parallel: Recommends TP size
143
+ - get_gpu_by_name: Approximates VRAM for known GPU names
144
+
145
+ - Model Metadata
146
+ - load_model_metadata: Loads and parses config.json from local or HF Hub
147
+ - ModelMetadata: Dataclass with architecture fields and derived properties
148
+
149
+ - Rendering
150
+ - render_serve_command, render_docker_command, render_docker_compose, render_k8s_values: Generate runnable commands and deployment manifests
151
+ - load_profile, save_profile, profile_to_request, request_to_profile: Profile serialization and conversion
152
+ - render_console_report, render_json: Human-readable and JSON reporting
153
+
154
+ - Schemas
155
+ - Inputs: DType, Quantization, KVCacheDType, Interconnect, BatchingMode, ModelInput, HardwareInput, WorkloadInput, PolicyInput, PlanRequest
156
+ - Outputs: GPUInfo, FeasibilityReport, VLLMConfig, PerfEstimate, Artifacts, PlanResponse
157
+ - Profile: ProfileModel, ProfileHardware, ProfileWorkload, ProfilePolicy, ProfileOutputs, Profile
158
+
159
+ **Section sources**
160
+ - [cli.py](file://src/vllm_wizard/cli.py#L62-L381)
161
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L21-L135)
162
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L31-L367)
163
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L136-L220)
164
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L167-L333)
165
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L10-L229)
166
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L209-L255)
167
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L6-L196)
168
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py#L54-L110)
169
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L17-L118)
170
+ - [schemas/profile.py](file://src/vllm_wizard/schemas/profile.py#L16-L75)
171
+
172
+ ## Architecture Overview
173
+ The planning pipeline integrates hardware detection, model metadata extraction, memory feasibility computation, recommendation engine, performance estimation, and artifact rendering.
174
+
175
+ ```mermaid
176
+ sequenceDiagram
177
+ participant User as "Caller"
178
+ participant CLI as "cli.plan()"
179
+ participant Planner as "planning.run_plan()"
180
+ participant Meta as "models.load_model_metadata()"
181
+ participant HW as "hardware.detect_gpus()"
182
+ participant Mem as "planning.compute_*"
183
+ participant Rec as "planning.generate_recommendations"
184
+ participant Perf as "planning.estimate_performance"
185
+ participant Cmd as "render.*_command"
186
+ User->>CLI : Invoke plan()
187
+ CLI->>Planner : Build PlanRequest and run_plan(request)
188
+ Planner->>Meta : Load model metadata
189
+ Planner->>HW : Resolve hardware (detect or configured)
190
+ Planner->>Mem : Weights/KV/Overhead/Feasibility
191
+ Planner->>Rec : Generate VLLMConfig
192
+ Planner->>Perf : Estimate performance ranges
193
+ Planner->>Cmd : Render serve/docker/compose/k8s
194
+ Planner-->>CLI : PlanResponse
195
+ CLI-->>User : Report or JSON
196
+ ```
197
+
198
+ **Diagram sources**
199
+ - [cli.py](file://src/vllm_wizard/cli.py#L82-L213)
200
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L21-L135)
201
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L209-L255)
202
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L10-L72)
203
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L31-L271)
204
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L167-L333)
205
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L136-L220)
206
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L6-L196)
207
+
208
+ ## Detailed Component Analysis
209
+
210
+ ### CLI API
211
+ Public entrypoints and their responsibilities:
212
+ - detect: Detects GPUs and prints a formatted list or JSON
213
+ - plan: Builds PlanRequest from CLI options or profile, runs planning, prints report or JSON
214
+ - generate: Builds PlanRequest, runs planning, writes artifacts to output directory
215
+
216
+ Key behaviors:
217
+ - Input parsing and validation via Typer options
218
+ - Error handling with colored console output and exit codes
219
+ - JSON vs human-readable output toggles
220
+
221
+ Programmatic usage patterns:
222
+ - Construct PlanRequest programmatically and call run_plan
223
+ - Use load_profile/save_profile for persistent configuration snapshots
224
+
225
+ **Section sources**
226
+ - [cli.py](file://src/vllm_wizard/cli.py#L62-L381)
227
+
228
+ ### Planning Engine API
229
+ - run_plan(request: PlanRequest) -> PlanResponse
230
+ - Orchestrates metadata loading, hardware resolution, memory feasibility, recommendations, performance estimation, and artifact rendering
231
+ - Raises ValueError if no GPUs are detected or specified
232
+ - compute_weights_memory(params_b, dtype, quantization) -> bytes
233
+ - compute_kv_cache_memory(metadata, context_len, concurrency, kv_dtype, dtype, fragmentation_factor) -> bytes
234
+ - compute_overhead(vram_total_bytes, tp_size, fixed_overhead_gb) -> bytes
235
+ - compute_feasibility(...) -> FeasibilityReport
236
+ - compute_max_concurrency_at_context(...)
237
+ - compute_max_context_at_concurrency(...)
238
+ - estimate_performance(gpu_name, params_b, tp_size, context_len, prompt_tokens, quantization, interconnect, num_gpus) -> PerfEstimate
239
+ - generate_recommendations(request, metadata, gpus, vram_total_bytes) -> VLLMConfig
240
+
241
+ Integration patterns:
242
+ - Call run_plan with a populated PlanRequest
243
+ - Use compute_* functions independently for custom logic
244
+ - Combine estimate_performance with feasibility results for richer reports
245
+
246
+ **Section sources**
247
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L21-L172)
248
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L31-L367)
249
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L136-L220)
250
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L167-L333)
251
+
252
+ ### Hardware Detection API
253
+ - detect_gpus() -> list[GPUInfo]
254
+ - Executes nvidia-smi and enriches with driver/CUDA versions and compute capability
255
+ - Returns empty list on failure
256
+ - recommend_tensor_parallel(gpus: list[GPUInfo]) -> int
257
+ - Largest power-of-two tensor-parallel size ≤ number of GPUs
258
+ - get_gpu_by_name(name: str) -> Optional[GPUInfo]
259
+ - Approximates VRAM for known GPU names
260
+
261
+ Usage:
262
+ - Call detect_gpus() when hardware.gpu == "auto"
263
+ - Use recommend_tensor_parallel to auto-size TP
264
+ - Fallback to get_gpu_by_name for manual VRAM estimation
265
+
266
+ **Section sources**
267
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L10-L229)
268
+
269
+ ### Model Metadata API
270
+ - load_model_metadata(model_id_or_path, revision, trust_remote_code, params_b) -> ModelMetadata
271
+ - Loads config.json from local path or HF Hub
272
+ - Parses architecture fields and derives head_dim
273
+ - Estimates or sets num_params
274
+ - ModelMetadata fields and properties
275
+ - num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads, vocab_size, max_position_embeddings, model_type, intermediate_size, num_params
276
+ - head_dim property
277
+ - params_billions property
278
+
279
+ Validation and errors:
280
+ - Raises FileNotFoundError if local config not found
281
+ - Raises ValueError for missing required fields or invalid config structure
282
+
283
+ **Section sources**
284
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L209-L255)
285
+
286
+ ### Rendering API
287
+ - render_serve_command(config: VLLMConfig) -> str
288
+ - render_docker_command(config: VLLMConfig) -> str
289
+ - render_docker_compose(config: VLLMConfig) -> str
290
+ - render_k8s_values(config: VLLMConfig) -> str
291
+ - load_profile(path: Path) -> Profile
292
+ - save_profile(profile: Profile, path: Path) -> None
293
+ - profile_to_request(profile: Profile) -> PlanRequest
294
+ - request_to_profile(request: PlanRequest, emit: list[str]) -> Profile
295
+ - render_console_report(response: PlanResponse, console) -> None
296
+ - render_json(response: PlanResponse) -> str
297
+
298
+ Integration patterns:
299
+ - Use render_* functions to produce runnable commands and deployment manifests
300
+ - Serialize/deserialize configurations via profile functions
301
+
302
+ **Section sources**
303
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L6-L196)
304
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
305
+ - [render/report.py](file://src/vllm_wizard/render/report.py)
306
+
307
+ ### Schemas and Data Models
308
+ Input schemas (BaseModel subclasses):
309
+ - DType: AUTO, FP16, BF16, FP32
310
+ - Quantization: NONE, AWQ, GPTQ, INT8, FP8
311
+ - KVCacheDType: AUTO, FP16, BF16, FP8_E4M3FN, FP8_E5M2
312
+ - Interconnect: PCIE, NVLINK, UNKNOWN
313
+ - BatchingMode: THROUGHPUT, LATENCY, BALANCED
314
+ - ModelInput: model, revision, trust_remote_code, dtype, quantization, kv_cache_dtype, max_model_len, tokenizer, params_b
315
+ - HardwareInput: gpu, gpus, vram_gb, interconnect, tensor_parallel_size
316
+ - WorkloadInput: prompt_tokens, gen_tokens, concurrency, target_latency_ms, streaming, batching_mode
317
+ - PolicyInput: gpu_memory_utilization, overhead_gb, fragmentation_factor, headroom_gb
318
+ - PlanRequest: model, hardware, workload, policy, explain
319
+
320
+ Output schemas:
321
+ - GPUInfo: name, vram_mib, compute_capability, driver_version, cuda_version (+ vram_gb, vram_gib properties)
322
+ - FeasibilityReport: fits, oom_risk, vram_total_gb, vram_target_alloc_gb, weights_gb, kv_cache_gb, overhead_gb, headroom_gb, max_concurrency_at_context, max_context_at_concurrency, warnings
323
+ - VLLMConfig: model, tensor_parallel_size, dtype, gpu_memory_utilization, max_model_len, kv_cache_dtype, quantization, swap_space, enforce_eager, max_num_seqs, max_num_batched_tokens, trust_remote_code, explanations
324
+ - PerfEstimate: decode_toks_per_s_range, prefill_toks_per_s_range, ttft_ms_range, assumptions
325
+ - Artifacts: serve_command, docker_command, docker_compose, k8s_values
326
+ - PlanResponse: feasibility, config, performance, artifacts (+ model_dump_json_pretty, to_dict helpers)
327
+
328
+ Profile schema:
329
+ - ProfileModel, ProfileHardware, ProfileWorkload, ProfilePolicy, ProfileOutputs
330
+ - Profile: profile_version, model, hardware, workload, policy, outputs
331
+
332
+ Validation:
333
+ - Pydantic field constraints (gt/ge/le ranges, enums, required fields)
334
+ - Runtime exceptions raised on invalid inputs
335
+
336
+ **Section sources**
337
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py#L9-L110)
338
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L17-L118)
339
+ - [schemas/profile.py](file://src/vllm_wizard/schemas/profile.py#L16-L75)
340
+
341
+ ### Extension Points and Plugin Interfaces
342
+ - Hardware detection extensibility:
343
+ - Implement alternative detection backends by returning GPUInfo-compatible structures
344
+ - Replace recommend_tensor_parallel with custom logic
345
+ - Model metadata extensibility:
346
+ - Extend load_model_metadata to support additional model sources or caching layers
347
+ - Rendering extensibility:
348
+ - Add new render_* functions for additional platforms or formats
349
+ - Extend ProfileOutputs to include custom runtime arguments
350
+ - Planning extensibility:
351
+ - Provide custom compute_* functions or integrate external memory calculators
352
+ - Swap estimate_performance with ML-based estimators
353
+
354
+ [No sources needed since this section provides general guidance]
355
+
356
+ ## Dependency Analysis
357
+ Module-level dependencies and coupling:
358
+ - CLI depends on planning, hardware, render, and schemas modules
359
+ - Planning orchestrator depends on hardware, models, memory, perf, recommend, and render
360
+ - Hardware detection depends on subprocess and schemas
361
+ - Rendering depends on schemas outputs
362
+ - Schemas are shared across modules and define contracts
363
+
364
+ ```mermaid
365
+ graph LR
366
+ CLI["cli.py"] --> PLAN_RUN["planning/planner.py"]
367
+ PLAN_RUN --> DETECT["hardware/detect.py"]
368
+ PLAN_RUN --> META["models/metadata.py"]
369
+ PLAN_RUN --> MEM["planning/memory.py"]
370
+ PLAN_RUN --> PERF["planning/perf.py"]
371
+ PLAN_RUN --> REC["planning/recommend.py"]
372
+ PLAN_RUN --> CMDS["render/commands.py"]
373
+ CLI --> INP["schemas/inputs.py"]
374
+ CLI --> OUT["schemas/outputs.py"]
375
+ CLI --> PROF_R["render/profile.py"]
376
+ CLI --> REP["render/report.py"]
377
+ ```
378
+
379
+ **Diagram sources**
380
+ - [cli.py](file://src/vllm_wizard/cli.py#L1-L385)
381
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L1-L172)
382
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L1-L229)
383
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L1-L255)
384
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L1-L367)
385
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L1-L220)
386
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L1-L333)
387
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L1-L196)
388
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py#L1-L110)
389
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L1-L118)
390
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
391
+ - [render/report.py](file://src/vllm_wizard/render/report.py)
392
+
393
+ **Section sources**
394
+ - [cli.py](file://src/vllm_wizard/cli.py#L1-L385)
395
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L1-L172)
396
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L1-L229)
397
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L1-L255)
398
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L1-L367)
399
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L1-L220)
400
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L1-L333)
401
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L1-L196)
402
+ - [schemas/inputs.py](file://src/vllm_wizard/schemas/inputs.py#L1-L110)
403
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L1-L118)
404
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
405
+ - [render/report.py](file://src/vllm_wizard/render/report.py)
406
+
407
+ ## Performance Considerations
408
+ - Memory calculations use conservative fragmentation factors and utilization targets; adjust PolicyInput fragmentation_factor and gpu_memory_utilization to tune aggressiveness
409
+ - Performance estimates are heuristic and include ranges; actual performance depends on vLLM version, drivers, and runtime conditions
410
+ - Tensor-parallel scaling efficiency depends on interconnect type; NVLINK yields better scaling than PCIe
411
+ - KV cache pressure thresholds guide KV cache dtype recommendations; consider FP8 KV cache on supported GPUs when pressure is high
412
+
413
+ [No sources needed since this section provides general guidance]
414
+
415
+ ## Troubleshooting Guide
416
+ Common exceptions and handling:
417
+ - ValueError: Raised for invalid inputs, missing model metadata, or insufficient VRAM configuration
418
+ - FileNotFoundError: Raised when local config.json is not found
419
+ - subprocess.TimeoutExpired and generic Exception: Swallowed by hardware detection to return empty lists
420
+ - CLI catches ValueError, FileNotFoundError, and generic exceptions, printing colored error messages and exiting with non-zero status
421
+
422
+ Operational tips:
423
+ - Ensure nvidia-smi is installed and accessible when relying on auto-detection
424
+ - Provide explicit --gpu and --vram-gb when auto-detection fails
425
+ - Use --explain to include recommendation rationales in VLLMConfig.explanations
426
+ - Validate inputs against Pydantic constraints (e.g., positive integers, enum values)
427
+
428
+ **Section sources**
429
+ - [cli.py](file://src/vllm_wizard/cli.py#L204-L212)
430
+ - [cli.py](file://src/vllm_wizard/cli.py#L372-L380)
431
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L65-L71)
432
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L42-L45)
433
+
434
+ ## Conclusion
435
+ The vLLM Wizard exposes a cohesive internal API centered around a planning pipeline that integrates hardware detection, model metadata extraction, memory feasibility, recommendations, and artifact rendering. Its schema-driven design ensures strong validation and clear contracts. The CLI provides convenient entrypoints, while the underlying modules enable programmatic integration and customization.
436
+
437
+ [No sources needed since this section summarizes without analyzing specific files]
438
+
439
+ ## Appendices
440
+
441
+ ### API Index
442
+ - CLI
443
+ - detect(json_output: bool) -> None
444
+ - plan(...options...) -> None
445
+ - generate(...options...) -> None
446
+ - Planning
447
+ - run_plan(request: PlanRequest) -> PlanResponse
448
+ - compute_weights_memory(params_b, dtype, quantization) -> int
449
+ - compute_kv_cache_memory(metadata, context_len, concurrency, kv_dtype, dtype, fragmentation_factor) -> int
450
+ - compute_overhead(vram_total_bytes, tp_size, fixed_overhead_gb) -> int
451
+ - compute_feasibility(...)-> FeasibilityReport
452
+ - compute_max_concurrency_at_context(...)
453
+ - compute_max_context_at_concurrency(...)
454
+ - estimate_performance(gpu_name, params_b, tp_size, context_len, prompt_tokens, quantization, interconnect, num_gpus) -> PerfEstimate
455
+ - generate_recommendations(request, metadata, gpus, vram_total_bytes) -> VLLMConfig
456
+ - Hardware
457
+ - detect_gpus() -> list[GPUInfo]
458
+ - recommend_tensor_parallel(gpus: list[GPUInfo]) -> int
459
+ - get_gpu_by_name(name: str) -> Optional[GPUInfo]
460
+ - Models
461
+ - load_model_metadata(model_id_or_path, revision, trust_remote_code, params_b) -> ModelMetadata
462
+ - Rendering
463
+ - render_serve_command(config: VLLMConfig) -> str
464
+ - render_docker_command(config: VLLMConfig) -> str
465
+ - render_docker_compose(config: VLLMConfig) -> str
466
+ - render_k8s_values(config: VLLMConfig) -> str
467
+ - load_profile(path: Path) -> Profile
468
+ - save_profile(profile: Profile, path: Path) -> None
469
+ - profile_to_request(profile: Profile) -> PlanRequest
470
+ - request_to_profile(request: PlanRequest, emit: list[str]) -> Profile
471
+ - render_console_report(response: PlanResponse, console) -> None
472
+ - render_json(response: PlanResponse) -> str
473
+
474
+ **Section sources**
475
+ - [cli.py](file://src/vllm_wizard/cli.py#L62-L381)
476
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L21-L172)
477
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L31-L367)
478
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L136-L220)
479
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L167-L333)
480
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L10-L229)
481
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L209-L255)
482
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L6-L196)
483
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
484
+ - [render/report.py](file://src/vllm_wizard/render/report.py)
485
+
486
+ ### Programmatic Usage Examples
487
+ - Build a PlanRequest programmatically and run planning:
488
+ - Populate ModelInput, HardwareInput, WorkloadInput, PolicyInput
489
+ - Call run_plan(request) to obtain PlanResponse
490
+ - Access feasibility, config, performance, and artifacts
491
+ - Generate artifacts to disk:
492
+ - Call generate with --output-dir and desired --emit options
493
+ - Save and load profiles via save_profile/load_profile and profile_to_request/request_to_profile
494
+ - Integrate hardware detection:
495
+ - Call detect_gpus() and use recommend_tensor_parallel for TP sizing
496
+ - Fallback to get_gpu_by_name for manual VRAM estimation
497
+
498
+ **Section sources**
499
+ - [cli.py](file://src/vllm_wizard/cli.py#L155-L203)
500
+ - [cli.py](file://src/vllm_wizard/cli.py#L275-L380)
501
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L21-L135)
502
+ - [render/profile.py](file://src/vllm_wizard/render/profile.py)
503
+
504
+ ### Error Handling and Validation Mechanisms
505
+ - Pydantic validation enforces field constraints and types
506
+ - CLI wraps planning in try/catch blocks and prints user-friendly messages
507
+ - Hardware detection handles missing binaries and timeouts gracefully
508
+ - Planning raises ValueError when hardware cannot be resolved
509
+
510
+ **Section sources**
511
+ - [cli.py](file://src/vllm_wizard/cli.py#L204-L212)
512
+ - [cli.py](file://src/vllm_wizard/cli.py#L372-L380)
513
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L65-L71)
514
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L42-L45)
515
+
516
+ ### Type Annotations and Data Flow Diagrams
517
+ - All public functions include type hints and return types
518
+ - Data flows from CLI through planning to rendering and schemas
519
+
520
+ ```mermaid
521
+ flowchart TD
522
+ Start(["PlanRequest"]) --> Meta["load_model_metadata()"]
523
+ Meta --> HW["detect_gpus()/get_gpu_by_name()"]
524
+ HW --> Mem["compute_weights_memory()<br/>compute_kv_cache_memory()<br/>compute_overhead()"]
525
+ Mem --> Feas["compute_feasibility()"]
526
+ Feas --> Rec["generate_recommendations()"]
527
+ Rec --> Perf["estimate_performance()"]
528
+ Perf --> Art["render_*_command()"]
529
+ Art --> Out["PlanResponse"]
530
+ ```
531
+
532
+ **Diagram sources**
533
+ - [planning/planner.py](file://src/vllm_wizard/planning/planner.py#L21-L135)
534
+ - [models/metadata.py](file://src/vllm_wizard/models/metadata.py#L209-L255)
535
+ - [hardware/detect.py](file://src/vllm_wizard/hardware/detect.py#L10-L229)
536
+ - [planning/memory.py](file://src/vllm_wizard/planning/memory.py#L31-L271)
537
+ - [planning/recommend.py](file://src/vllm_wizard/planning/recommend.py#L167-L333)
538
+ - [planning/perf.py](file://src/vllm_wizard/planning/perf.py#L136-L220)
539
+ - [render/commands.py](file://src/vllm_wizard/render/commands.py#L6-L196)
540
+ - [schemas/outputs.py](file://src/vllm_wizard/schemas/outputs.py#L103-L118)
541
+
542
+ ### Extension Points and Plugin Interfaces
543
+ - Implement custom hardware backends by returning GPUInfo-compatible structures
544
+ - Add new renderers by implementing render_* functions
545
+ - Extend model metadata loaders for additional sources
546
+ - Customize recommendation logic by replacing generate_recommendations
547
+
548
+ [No sources needed since this section provides general guidance]
549
+
550
+ ### API Versioning and Compatibility
551
+ - Package version is defined in __init__.py
552
+ - Profile schema includes profile_version for future migrations
553
+ - Schemas use Pydantic BaseModel for backward-compatible field additions where possible
554
+
555
+ **Section sources**
556
+ - [__init__.py](file://src/vllm_wizard/__version__)
557
+ - [schemas/profile.py](file://src/vllm_wizard/schemas/profile.py#L69-L74)
558
+
559
+ ### Migration Guides and Deprecation Notices
560
+ - No explicit deprecations are present in the current codebase
561
+ - When extending or modifying schemas, increment profile_version and maintain backward compatibility where feasible
562
+ - Prefer using request_to_profile and profile_to_request for serialization to minimize breaking changes
563
+
564
+ [No sources needed since this section provides general guidance]