packwise-skills 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. package/.cursorrules +23 -23
  2. package/CLAUDE.md +25 -25
  3. package/LICENSE +21 -0
  4. package/README.md +404 -295
  5. package/audit.md +224 -224
  6. package/bin/packwise.js +322 -155
  7. package/install.sh +123 -0
  8. package/package.json +32 -31
  9. package/skill.md +944 -719
  10. package/sub-skills/ai/local-llm.md +183 -183
  11. package/sub-skills/ai/python-ml.md +164 -164
  12. package/sub-skills/backend/go-server.md +184 -184
  13. package/sub-skills/backend/java-spring.md +241 -241
  14. package/sub-skills/backend/node-server.md +164 -164
  15. package/sub-skills/backend/php-laravel.md +175 -175
  16. package/sub-skills/backend/python-server.md +164 -164
  17. package/sub-skills/backend/rust-backend.md +118 -118
  18. package/sub-skills/cli/python-cli.md +236 -236
  19. package/sub-skills/cli/sdk-library.md +497 -497
  20. package/sub-skills/cloud/ci-cd-pipelines.md +350 -350
  21. package/sub-skills/cloud/docker.md +191 -191
  22. package/sub-skills/cloud/kubernetes.md +277 -277
  23. package/sub-skills/cloud/payment-integration.md +307 -307
  24. package/sub-skills/cross-platform/multiplatform.md +252 -252
  25. package/sub-skills/desktop/electron.md +783 -783
  26. package/sub-skills/desktop/game-dev.md +443 -443
  27. package/sub-skills/desktop/native-app.md +123 -123
  28. package/sub-skills/desktop/scenarios.md +443 -443
  29. package/sub-skills/desktop/smart-platforms.md +324 -324
  30. package/sub-skills/desktop/tauri.md +428 -428
  31. package/sub-skills/desktop/vr-ar.md +252 -252
  32. package/sub-skills/desktop/web-to-desktop.md +153 -153
  33. package/sub-skills/embedded/car-infotainment.md +129 -129
  34. package/sub-skills/embedded/esp32.md +184 -184
  35. package/sub-skills/embedded/ros.md +150 -150
  36. package/sub-skills/embedded/stm32.md +160 -160
  37. package/sub-skills/mobile/android.md +322 -322
  38. package/sub-skills/mobile/capacitor.md +232 -232
  39. package/sub-skills/mobile/flutter-mobile.md +138 -138
  40. package/sub-skills/mobile/harmonyos.md +150 -150
  41. package/sub-skills/mobile/ios.md +245 -245
  42. package/sub-skills/mobile/react-native.md +443 -443
  43. package/sub-skills/mobile/wearables.md +230 -230
  44. package/sub-skills/plugins/browser-extension.md +308 -308
  45. package/sub-skills/plugins/jetbrains-plugin.md +226 -226
  46. package/sub-skills/plugins/vscode-extension.md +204 -204
  47. package/sub-skills/security/security-tools.md +174 -174
  48. package/sub-skills/web/monorepo.md +274 -274
  49. package/sub-skills/web/pwa.md +220 -220
  50. package/sub-skills/web/serverless-edge.md +295 -295
  51. package/sub-skills/web/spa.md +266 -266
  52. package/sub-skills/web/ssr.md +228 -228
  53. package/sub-skills/web/wasm.md +243 -243
@@ -1,183 +1,183 @@
1
- # Local LLM Application Build Sub-Skill
2
-
3
- Package and deploy local large language model applications (offline AI, private deployment, edge inference).
4
-
5
- **Current versions**: Ollama 0.4+ / llama.cpp b4000+ / vLLM 0.6+ (2025-2026)
6
-
7
- ## When to Use
8
-
9
- - Offline AI assistant (no internet required)
10
- - Privacy-sensitive AI applications (enterprise internal)
11
- - Edge AI deployment (Jetson, Raspberry Pi, local servers)
12
- - Cost optimization (avoid API fees)
13
- - Custom fine-tuned model serving
14
-
15
- ## Tech Stack Comparison
16
-
17
- | Framework | Language | GPU Support | Best For | Setup Complexity |
18
- |-----------|---------|-------------|---------|-----------------|
19
- | Ollama | Go | CUDA/Metal/ROCm | Simplest local LLM runtime | Lowest |
20
- | llama.cpp | C++ | CUDA/Metal/Vulkan/ROCm | CPU inference, maximum control | Medium |
21
- | vLLM | Python | CUDA only | High-throughput GPU serving | Medium |
22
- | LM Studio | Desktop app | CUDA/Metal | GUI-based model management | Lowest |
23
- | text-generation-inference | Rust/Python | CUDA | Production GPU serving (Hugging Face) | High |
24
- | LocalAI | Go | CUDA/Metal | OpenAI-compatible local API | Low |
25
-
26
- ---
27
-
28
- ## Ollama (Recommended for Getting Started)
29
-
30
- ### Install & Run
31
-
32
- ```bash
33
- # Install
34
- curl -fsSL https://ollama.ai/install.sh | sh # Linux
35
- brew install ollama # macOS
36
- # Windows: download from ollama.ai
37
-
38
- # Run model
39
- ollama run llama3.1 # 8B (default)
40
- ollama run llama3.1:70b # 70B (needs ~40GB VRAM)
41
- ollama run codellama # Code-specific model
42
- ollama run mistral # Mistral 7B
43
- ollama run phi3 # Microsoft Phi-3
44
-
45
- # API call (OpenAI-compatible)
46
- curl http://localhost:11434/v1/chat/completions -d '{
47
- "model": "llama3.1",
48
- "messages": [{"role": "user", "content": "Hello!"}]
49
- }'
50
-
51
- # List installed models
52
- ollama list
53
-
54
- # Pull model without running
55
- ollama pull llama3.1:70b
56
- ```
57
-
58
- ### Docker Deployment
59
-
60
- ```yaml
61
- # docker-compose.yml
62
- services:
63
- ollama:
64
- image: ollama/ollama
65
- ports: ["11434:11434"]
66
- volumes: ["ollama:/root/.ollama"]
67
- deploy:
68
- resources:
69
- reservations:
70
- devices:
71
- - driver: nvidia
72
- count: all
73
- capabilities: [gpu]
74
- open-webui:
75
- image: ghcr.io/open-webui/open-webui
76
- ports: ["3000:8080"]
77
- environment:
78
- OLLAMA_BASE_URL: http://ollama:11434
79
- depends_on: [ollama]
80
- volumes:
81
- ollama:
82
- ```
83
-
84
- ---
85
-
86
- ## llama.cpp (Maximum Control)
87
-
88
- ### Build & Run
89
-
90
- ```bash
91
- # Clone and build
92
- git clone https://github.com/ggerganov/llama.cpp
93
- cd llama.cpp
94
- make -j$(nproc) # CPU only
95
- make -j$(nproc) CUDA=1 # NVIDIA GPU
96
- make -j$(nproc) METAL=1 # Apple Silicon
97
-
98
- # Run
99
- ./llama-server -m models/llama-3.1-8b-q4_k_m.gguf \
100
- --host 0.0.0.0 --port 8080 \
101
- -ngl 99 \ # Offload all layers to GPU
102
- -c 4096 # Context size
103
-
104
- # Quantize model
105
- ./llama-quantize input.gguf output-q4_k_m.gguf Q4_K_M
106
- ```
107
-
108
- ### GGUF Quantization Levels
109
-
110
- | Quant | Size (8B) | Quality | Speed | Use Case |
111
- |-------|----------|---------|-------|----------|
112
- | Q2_K | ~3 GB | Low | Fastest | Maximum compression |
113
- | Q4_K_M | ~5 GB | Good | Fast | **Recommended default** |
114
- | Q5_K_M | ~6 GB | Better | Good | Quality-sensitive |
115
- | Q6_K | ~7 GB | Great | Slower | Near-lossless |
116
- | Q8_0 | ~8 GB | Excellent | Slower | Minimal quality loss |
117
- | F16 | ~16 GB | Lossless | Slowest | Research/evaluation |
118
-
119
- ---
120
-
121
- ## vLLM (High-Throughput GPU Serving)
122
-
123
- ```bash
124
- pip install vllm
125
-
126
- # Serve model (OpenAI-compatible API)
127
- vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
128
- --host 0.0.0.0 --port 8000 \
129
- --tensor-parallel-size 1 \ # Number of GPUs
130
- --max-model-len 4096 \
131
- --quantization awq # AWQ quantization (optional)
132
- ```
133
-
134
- ```python
135
- # Use as Python library
136
- from vllm import LLM, SamplingParams
137
-
138
- llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
139
- params = SamplingParams(temperature=0.7, max_tokens=512)
140
- outputs = llm.generate(["Hello, how are you?"], params)
141
- print(outputs[0].outputs[0].text)
142
- ```
143
-
144
- ---
145
-
146
- ## Memory Requirements (Approximate)
147
-
148
- | Model | Q4_K_M (Recommended) | FP16 (Full) | Min GPU VRAM |
149
- |-------|---------------------|-------------|-------------|
150
- | 1B | ~1 GB | ~2 GB | 2 GB |
151
- | 3B | ~2 GB | ~6 GB | 4 GB |
152
- | 7B/8B | ~5 GB | ~14 GB | 8 GB |
153
- | 13B | ~8 GB | ~26 GB | 16 GB |
154
- | 34B | ~20 GB | ~68 GB | 40 GB |
155
- | 70B | ~40 GB | ~140 GB | 2×40 GB |
156
-
157
- ---
158
-
159
- ## Hardware Recommendations
160
-
161
- | Use Case | Minimum | Recommended |
162
- |----------|---------|------------|
163
- | Chat (7B) | 16GB RAM (CPU) | 8GB VRAM GPU |
164
- | Chat (13B+) | 32GB RAM or 16GB VRAM | 24GB VRAM (RTX 4090) |
165
- | Code (7B) | 16GB RAM | 12GB VRAM |
166
- | Production serving | 24GB VRAM | A100 40GB / H100 |
167
- | Edge (Raspberry Pi) | 8GB RAM (very slow) | Jetson Orin 16GB |
168
- | Apple Silicon Mac | 16GB unified | 32GB+ unified (M2/M3 Pro/Max) |
169
-
170
- ---
171
-
172
- ## Common Pitfalls
173
-
174
- | Issue | Fix |
175
- |-------|-----|
176
- | Slow model download | Use Hugging Face mirror (`HF_ENDPOINT`); use `ollama pull` for Ollama |
177
- | GPU not detected | Check `nvidia-smi`; install NVIDIA Container Toolkit for Docker |
178
- | Out of memory | Use smaller quantization (Q4_K_M); reduce context length; use CPU offload |
179
- | Slow response | Use GPU; reduce model size; use `--flash-attention` |
180
- | CORS errors when calling API | Ollama: set `OLLAMA_ORIGINS=*`; llama.cpp: add `--cors *` |
181
- | Model hallucination | Use system prompt; lower temperature; use RAG for factual accuracy |
182
- | Docker GPU not working | Install `nvidia-container-toolkit`; restart Docker daemon |
183
- | Apple Silicon not using GPU | Use Metal-enabled build; Ollama uses Metal by default on macOS |
1
+ # Local LLM Application Build Sub-Skill
2
+
3
+ Package and deploy local large language model applications (offline AI, private deployment, edge inference).
4
+
5
+ **Current versions**: Ollama 0.4+ / llama.cpp b4000+ / vLLM 0.6+ (2025-2026)
6
+
7
+ ## When to Use
8
+
9
+ - Offline AI assistant (no internet required)
10
+ - Privacy-sensitive AI applications (enterprise internal)
11
+ - Edge AI deployment (Jetson, Raspberry Pi, local servers)
12
+ - Cost optimization (avoid API fees)
13
+ - Custom fine-tuned model serving
14
+
15
+ ## Tech Stack Comparison
16
+
17
+ | Framework | Language | GPU Support | Best For | Setup Complexity |
18
+ |-----------|---------|-------------|---------|-----------------|
19
+ | Ollama | Go | CUDA/Metal/ROCm | Simplest local LLM runtime | Lowest |
20
+ | llama.cpp | C++ | CUDA/Metal/Vulkan/ROCm | CPU inference, maximum control | Medium |
21
+ | vLLM | Python | CUDA only | High-throughput GPU serving | Medium |
22
+ | LM Studio | Desktop app | CUDA/Metal | GUI-based model management | Lowest |
23
+ | text-generation-inference | Rust/Python | CUDA | Production GPU serving (Hugging Face) | High |
24
+ | LocalAI | Go | CUDA/Metal | OpenAI-compatible local API | Low |
25
+
26
+ ---
27
+
28
+ ## Ollama (Recommended for Getting Started)
29
+
30
+ ### Install & Run
31
+
32
+ ```bash
33
+ # Install
34
+ curl -fsSL https://ollama.ai/install.sh | sh # Linux
35
+ brew install ollama # macOS
36
+ # Windows: download from ollama.ai
37
+
38
+ # Run model
39
+ ollama run llama3.1 # 8B (default)
40
+ ollama run llama3.1:70b # 70B (needs ~40GB VRAM)
41
+ ollama run codellama # Code-specific model
42
+ ollama run mistral # Mistral 7B
43
+ ollama run phi3 # Microsoft Phi-3
44
+
45
+ # API call (OpenAI-compatible)
46
+ curl http://localhost:11434/v1/chat/completions -d '{
47
+ "model": "llama3.1",
48
+ "messages": [{"role": "user", "content": "Hello!"}]
49
+ }'
50
+
51
+ # List installed models
52
+ ollama list
53
+
54
+ # Pull model without running
55
+ ollama pull llama3.1:70b
56
+ ```
57
+
58
+ ### Docker Deployment
59
+
60
+ ```yaml
61
+ # docker-compose.yml
62
+ services:
63
+ ollama:
64
+ image: ollama/ollama
65
+ ports: ["11434:11434"]
66
+ volumes: ["ollama:/root/.ollama"]
67
+ deploy:
68
+ resources:
69
+ reservations:
70
+ devices:
71
+ - driver: nvidia
72
+ count: all
73
+ capabilities: [gpu]
74
+ open-webui:
75
+ image: ghcr.io/open-webui/open-webui
76
+ ports: ["3000:8080"]
77
+ environment:
78
+ OLLAMA_BASE_URL: http://ollama:11434
79
+ depends_on: [ollama]
80
+ volumes:
81
+ ollama:
82
+ ```
83
+
84
+ ---
85
+
86
+ ## llama.cpp (Maximum Control)
87
+
88
+ ### Build & Run
89
+
90
+ ```bash
91
+ # Clone and build
92
+ git clone https://github.com/ggerganov/llama.cpp
93
+ cd llama.cpp
94
+ make -j$(nproc) # CPU only
95
+ make -j$(nproc) CUDA=1 # NVIDIA GPU
96
+ make -j$(nproc) METAL=1 # Apple Silicon
97
+
98
+ # Run
99
+ ./llama-server -m models/llama-3.1-8b-q4_k_m.gguf \
100
+ --host 0.0.0.0 --port 8080 \
101
+ -ngl 99 \ # Offload all layers to GPU
102
+ -c 4096 # Context size
103
+
104
+ # Quantize model
105
+ ./llama-quantize input.gguf output-q4_k_m.gguf Q4_K_M
106
+ ```
107
+
108
+ ### GGUF Quantization Levels
109
+
110
+ | Quant | Size (8B) | Quality | Speed | Use Case |
111
+ |-------|----------|---------|-------|----------|
112
+ | Q2_K | ~3 GB | Low | Fastest | Maximum compression |
113
+ | Q4_K_M | ~5 GB | Good | Fast | **Recommended default** |
114
+ | Q5_K_M | ~6 GB | Better | Good | Quality-sensitive |
115
+ | Q6_K | ~7 GB | Great | Slower | Near-lossless |
116
+ | Q8_0 | ~8 GB | Excellent | Slower | Minimal quality loss |
117
+ | F16 | ~16 GB | Lossless | Slowest | Research/evaluation |
118
+
119
+ ---
120
+
121
+ ## vLLM (High-Throughput GPU Serving)
122
+
123
+ ```bash
124
+ pip install vllm
125
+
126
+ # Serve model (OpenAI-compatible API)
127
+ vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
128
+ --host 0.0.0.0 --port 8000 \
129
+ --tensor-parallel-size 1 \ # Number of GPUs
130
+ --max-model-len 4096 \
131
+ --quantization awq # AWQ quantization (optional)
132
+ ```
133
+
134
+ ```python
135
+ # Use as Python library
136
+ from vllm import LLM, SamplingParams
137
+
138
+ llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
139
+ params = SamplingParams(temperature=0.7, max_tokens=512)
140
+ outputs = llm.generate(["Hello, how are you?"], params)
141
+ print(outputs[0].outputs[0].text)
142
+ ```
143
+
144
+ ---
145
+
146
+ ## Memory Requirements (Approximate)
147
+
148
+ | Model | Q4_K_M (Recommended) | FP16 (Full) | Min GPU VRAM |
149
+ |-------|---------------------|-------------|-------------|
150
+ | 1B | ~1 GB | ~2 GB | 2 GB |
151
+ | 3B | ~2 GB | ~6 GB | 4 GB |
152
+ | 7B/8B | ~5 GB | ~14 GB | 8 GB |
153
+ | 13B | ~8 GB | ~26 GB | 16 GB |
154
+ | 34B | ~20 GB | ~68 GB | 40 GB |
155
+ | 70B | ~40 GB | ~140 GB | 2×40 GB |
156
+
157
+ ---
158
+
159
+ ## Hardware Recommendations
160
+
161
+ | Use Case | Minimum | Recommended |
162
+ |----------|---------|------------|
163
+ | Chat (7B) | 16GB RAM (CPU) | 8GB VRAM GPU |
164
+ | Chat (13B+) | 32GB RAM or 16GB VRAM | 24GB VRAM (RTX 4090) |
165
+ | Code (7B) | 16GB RAM | 12GB VRAM |
166
+ | Production serving | 24GB VRAM | A100 40GB / H100 |
167
+ | Edge (Raspberry Pi) | 8GB RAM (very slow) | Jetson Orin 16GB |
168
+ | Apple Silicon Mac | 16GB unified | 32GB+ unified (M2/M3 Pro/Max) |
169
+
170
+ ---
171
+
172
+ ## Common Pitfalls
173
+
174
+ | Issue | Fix |
175
+ |-------|-----|
176
+ | Slow model download | Use Hugging Face mirror (`HF_ENDPOINT`); use `ollama pull` for Ollama |
177
+ | GPU not detected | Check `nvidia-smi`; install NVIDIA Container Toolkit for Docker |
178
+ | Out of memory | Use smaller quantization (Q4_K_M); reduce context length; use CPU offload |
179
+ | Slow response | Use GPU; reduce model size; use `--flash-attention` |
180
+ | CORS errors when calling API | Ollama: set `OLLAMA_ORIGINS=*`; llama.cpp: add `--cors *` |
181
+ | Model hallucination | Use system prompt; lower temperature; use RAG for factual accuracy |
182
+ | Docker GPU not working | Install `nvidia-container-toolkit`; restart Docker daemon |
183
+ | Apple Silicon not using GPU | Use Metal-enabled build; Ollama uses Metal by default on macOS |