@yeongjaeyou/claude-code-config 0.22.0 → 0.24.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/web-researcher.md +5 -5
- package/.claude/commands/gh/decompose-issue.md +40 -5
- package/.claude/guidelines/ml-guidelines.md +15 -0
- package/.claude/hooks/auto-format-python.py +53 -0
- package/.claude/settings.json +24 -0
- package/.claude/skills/gpu-parallel-pipeline/SKILL.md +99 -0
- package/.claude/skills/gpu-parallel-pipeline/references/architecture.md +194 -0
- package/.claude/skills/gpu-parallel-pipeline/references/single-gpu-patterns.md +225 -0
- package/.claude/skills/gpu-parallel-pipeline/references/troubleshooting.md +247 -0
- package/.claude/skills/gpu-parallel-pipeline/scripts/check_gpu_memory.py +80 -0
- package/.claude/skills/interview-methodology/SKILL.md +10 -1
- package/package.json +1 -1
- package/.mcp.json +0 -35
|
@@ -60,12 +60,12 @@ findings:
|
|
|
60
60
|
- title: "finding title"
|
|
61
61
|
summary: "summary"
|
|
62
62
|
url: "https://..."
|
|
63
|
-
date: "
|
|
63
|
+
date: "2026-01-15"
|
|
64
64
|
reliability: high | medium | low
|
|
65
65
|
sources:
|
|
66
66
|
- url: "https://..."
|
|
67
67
|
title: "source title"
|
|
68
|
-
date: "
|
|
68
|
+
date: "2026-01-15"
|
|
69
69
|
platform: "github"
|
|
70
70
|
confidence: 0.8 # 0.0-1.0 (information sufficiency)
|
|
71
71
|
```
|
|
@@ -80,7 +80,7 @@ conflicts:
|
|
|
80
80
|
- "Reddit recommends A, but official docs recommend B"
|
|
81
81
|
suggested_followups:
|
|
82
82
|
- platform: "arxiv"
|
|
83
|
-
query: "benchmark comparison
|
|
83
|
+
query: "benchmark comparison 2026"
|
|
84
84
|
reason: "Need performance data"
|
|
85
85
|
```
|
|
86
86
|
|
|
@@ -345,8 +345,8 @@ When conflicts remain unresolved:
|
|
|
345
345
|
|
|
346
346
|
| Source | Position | Date | Reliability |
|
|
347
347
|
|--------|----------|------|-------------|
|
|
348
|
-
| [Official Docs](URL) | Recommends approach A |
|
|
349
|
-
| [Reddit](URL) | Approach B more practical |
|
|
348
|
+
| [Official Docs](URL) | Recommends approach A | 2026-01 | HIGH |
|
|
349
|
+
| [Reddit](URL) | Approach B more practical | 2025-06 | MEDIUM |
|
|
350
350
|
|
|
351
351
|
**Analysis**: Official docs recommend A, but community finds B more effective in practice. Choose based on your situation.
|
|
352
352
|
```
|
|
@@ -19,7 +19,7 @@ Break down large work items into manageable, independent issues. Follow project
|
|
|
19
19
|
- If non-code work (docs/infra) → Inform: "TDD not required. (Reason: Non-code work)"
|
|
20
20
|
- If TDD selected: Add `<!-- TDD: enabled -->` marker to each issue body
|
|
21
21
|
4. Analyze work: Understand core requirements and objectives
|
|
22
|
-
5. Decompose work: Split major tasks into
|
|
22
|
+
5. Decompose work: Split major tasks into **context-completable units** - each issue should be completable in a single Claude session without context switching. Group related features together rather than splitting by individual functions
|
|
23
23
|
6. Analyze dependencies: Identify prerequisite tasks
|
|
24
24
|
7. Suggest milestone name: Propose a milestone to group decomposed tasks
|
|
25
25
|
8. Check related PRs (optional): Run `gh pr list --state closed --limit 20` for similar work references (skip if none)
|
|
@@ -33,6 +33,33 @@ Break down large work items into manageable, independent issues. Follow project
|
|
|
33
33
|
- If project exists: Ask user via AskUserQuestion whether to add issues
|
|
34
34
|
- If yes: Run `gh project item-add <project-number> --owner <owner> --url <issue-url>` for each issue
|
|
35
35
|
|
|
36
|
+
## Issue Sizing Principle
|
|
37
|
+
|
|
38
|
+
### Context-Completable Units
|
|
39
|
+
Each issue should be designed to be **completable in a single Claude session**:
|
|
40
|
+
|
|
41
|
+
- **Group related features** rather than splitting by individual functions
|
|
42
|
+
- **Minimize context switching** - all necessary information should be within the issue
|
|
43
|
+
- **Include implementation details** - specific enough that no external lookup is needed during execution
|
|
44
|
+
|
|
45
|
+
### Sizing Guidelines
|
|
46
|
+
|
|
47
|
+
| Good (Context-Completable) | Bad (Over-Fragmented) |
|
|
48
|
+
|---------------------------|----------------------|
|
|
49
|
+
| "Add user authentication with login/logout/session" | "Add login button", "Add logout button", "Add session handling" (3 separate issues) |
|
|
50
|
+
| "Implement CRUD API for products" | "Add create endpoint", "Add read endpoint", "Add update endpoint", "Add delete endpoint" (4 separate issues) |
|
|
51
|
+
| "Setup CI/CD pipeline with test and deploy stages" | "Add test stage", "Add deploy stage" (2 separate issues) |
|
|
52
|
+
|
|
53
|
+
### Issue Content Depth
|
|
54
|
+
Since issues are larger, content must be **more detailed**:
|
|
55
|
+
|
|
56
|
+
1. **Implementation order** - numbered steps for execution sequence
|
|
57
|
+
2. **File-by-file changes** - specific modifications per file
|
|
58
|
+
3. **Code snippets** - key patterns or structures to implement
|
|
59
|
+
4. **Edge cases** - known gotchas or considerations
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
36
63
|
## Milestone Description Guidelines
|
|
37
64
|
|
|
38
65
|
Milestone description must include:
|
|
@@ -59,12 +86,20 @@ Examples (vary by project, for reference only):
|
|
|
59
86
|
|
|
60
87
|
**Purpose**: [Why this is needed]
|
|
61
88
|
|
|
62
|
-
**
|
|
63
|
-
|
|
64
|
-
|
|
89
|
+
**Implementation Steps** (in order):
|
|
90
|
+
1. [ ] Step 1 - description with specific details
|
|
91
|
+
2. [ ] Step 2 - description with specific details
|
|
92
|
+
3. [ ] Step 3 - description with specific details
|
|
65
93
|
|
|
66
94
|
**Files to modify**:
|
|
67
|
-
- `path/filename` -
|
|
95
|
+
- `path/filename` - Specific change (add/modify/remove what)
|
|
96
|
+
- `path/filename2` - Specific change with code pattern if needed
|
|
97
|
+
|
|
98
|
+
**Key Implementation Details**:
|
|
99
|
+
```
|
|
100
|
+
// Include code snippets, patterns, or structures when helpful
|
|
101
|
+
// This reduces need for external lookup during execution
|
|
102
|
+
```
|
|
68
103
|
|
|
69
104
|
**Completion criteria**:
|
|
70
105
|
- [ ] Implementation complete (all tasks checked)
|
|
@@ -47,6 +47,21 @@ img = annotator.annotate(img, detections) # BGR colors on RGB image -> inverted
|
|
|
47
47
|
plt.imshow(img)
|
|
48
48
|
```
|
|
49
49
|
|
|
50
|
+
### Exception: Ultralytics YOLO
|
|
51
|
+
|
|
52
|
+
Ultralytics YOLO handles BGR to RGB conversion internally. Do NOT manually convert before passing to YOLO.
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
# Correct: Pass BGR directly to YOLO (handles conversion internally)
|
|
56
|
+
img = cv2.imread(path) # BGR
|
|
57
|
+
results = yolo_model(img) # YOLO converts BGR->RGB internally
|
|
58
|
+
|
|
59
|
+
# Wrong: Manual conversion causes double conversion
|
|
60
|
+
img = cv2.imread(path)
|
|
61
|
+
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Unnecessary
|
|
62
|
+
results = yolo_model(img) # YOLO converts again -> RGB->BGR (inverted)
|
|
63
|
+
```
|
|
64
|
+
|
|
50
65
|
## Ultralytics WandB Integration
|
|
51
66
|
|
|
52
67
|
Ultralytics YOLO has WandB disabled by default.
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Auto-fix Python files with ruff after Write/Edit operations."""
|
|
3
|
+
|
|
4
|
+
import json
|
|
5
|
+
import sys
|
|
6
|
+
import subprocess
|
|
7
|
+
import os
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
def main():
|
|
11
|
+
"""Main hook handler."""
|
|
12
|
+
try:
|
|
13
|
+
data = json.load(sys.stdin)
|
|
14
|
+
except json.JSONDecodeError:
|
|
15
|
+
sys.exit(0)
|
|
16
|
+
|
|
17
|
+
file_path = data.get("tool_input", {}).get("file_path")
|
|
18
|
+
if not file_path:
|
|
19
|
+
sys.exit(0)
|
|
20
|
+
|
|
21
|
+
# Only process Python files
|
|
22
|
+
if not file_path.endswith((".py", ".pyi")):
|
|
23
|
+
sys.exit(0)
|
|
24
|
+
|
|
25
|
+
# Skip if file doesn't exist
|
|
26
|
+
if not os.path.isfile(file_path):
|
|
27
|
+
sys.exit(0)
|
|
28
|
+
|
|
29
|
+
project_dir = os.environ.get("CLAUDE_PROJECT_DIR", os.getcwd())
|
|
30
|
+
|
|
31
|
+
try:
|
|
32
|
+
# Lint fix
|
|
33
|
+
subprocess.run(
|
|
34
|
+
["uv", "run", "ruff", "check", "--fix", file_path],
|
|
35
|
+
cwd=project_dir,
|
|
36
|
+
capture_output=True,
|
|
37
|
+
timeout=30,
|
|
38
|
+
)
|
|
39
|
+
# Format
|
|
40
|
+
subprocess.run(
|
|
41
|
+
["uv", "run", "ruff", "format", file_path],
|
|
42
|
+
cwd=project_dir,
|
|
43
|
+
capture_output=True,
|
|
44
|
+
timeout=30,
|
|
45
|
+
)
|
|
46
|
+
except (subprocess.TimeoutExpired, FileNotFoundError):
|
|
47
|
+
pass
|
|
48
|
+
|
|
49
|
+
sys.exit(0)
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
if __name__ == "__main__":
|
|
53
|
+
main()
|
package/.claude/settings.json
CHANGED
|
@@ -10,6 +10,30 @@
|
|
|
10
10
|
}
|
|
11
11
|
]
|
|
12
12
|
}
|
|
13
|
+
],
|
|
14
|
+
"PreToolUse": [
|
|
15
|
+
{
|
|
16
|
+
"matcher": "Edit|Write",
|
|
17
|
+
"hooks": [
|
|
18
|
+
{
|
|
19
|
+
"type": "command",
|
|
20
|
+
"command": "python3 -c \"import json,sys; p=json.load(sys.stdin).get('tool_input',{}).get('file_path',''); blocked=['.env','.env.local','package-lock.json','.git/','secrets']; sys.exit(2 if any(b in p for b in blocked) else 0)\"",
|
|
21
|
+
"timeout": 5
|
|
22
|
+
}
|
|
23
|
+
]
|
|
24
|
+
}
|
|
25
|
+
],
|
|
26
|
+
"PostToolUse": [
|
|
27
|
+
{
|
|
28
|
+
"matcher": "Edit|Write",
|
|
29
|
+
"hooks": [
|
|
30
|
+
{
|
|
31
|
+
"type": "command",
|
|
32
|
+
"command": "python3 .claude/hooks/auto-format-python.py",
|
|
33
|
+
"timeout": 60
|
|
34
|
+
}
|
|
35
|
+
]
|
|
36
|
+
}
|
|
13
37
|
]
|
|
14
38
|
}
|
|
15
39
|
}
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: gpu-parallel-pipeline
|
|
3
|
+
description: Design and implement PyTorch GPU parallel processing pipelines for maximum throughput. Use when scaling workloads across multiple GPUs (ProcessPool, CUDA_VISIBLE_DEVICES isolation), optimizing single GPU utilization (CUDA Streams, async inference, model batching), or building I/O + compute pipelines (ThreadPool for loading, batch inference). Triggers on "multi-GPU", "GPU parallel", "batch inference", "CUDA isolation", "GPU utilization", "ProcessPool GPU", "PyTorch multi-GPU".
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# GPU Parallel Pipeline
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
This skill provides patterns for maximizing GPU throughput in data processing pipelines.
|
|
11
|
+
|
|
12
|
+
**Three core patterns:**
|
|
13
|
+
1. **Multi-GPU Distribution** - ProcessPool with GPU isolation via CUDA_VISIBLE_DEVICES
|
|
14
|
+
2. **Single GPU Optimization** - CUDA Streams, async inference, model batching
|
|
15
|
+
3. **I/O + Compute Pipeline** - ThreadPool for I/O parallelization + batch inference
|
|
16
|
+
|
|
17
|
+
## Quick Reference
|
|
18
|
+
|
|
19
|
+
| Pattern | Use Case | Speedup |
|
|
20
|
+
|---------|----------|---------|
|
|
21
|
+
| Multi-GPU ProcessPool | Large dataset, multiple GPUs | ~N x (N = GPU count) |
|
|
22
|
+
| ThreadPool I/O + Batch | I/O bottleneck (image loading) | 2-5x |
|
|
23
|
+
| CUDA Streams | Multiple models on single GPU | 1.5-3x |
|
|
24
|
+
|
|
25
|
+
## Multi-GPU Architecture
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
Main Process (Coordinator)
|
|
29
|
+
|
|
|
30
|
+
+-- GPU 0: ProcessPool Worker (CUDA_VISIBLE_DEVICES=0)
|
|
31
|
+
| +-- ThreadPool (I/O)
|
|
32
|
+
| +-- Model batch inference
|
|
33
|
+
|
|
|
34
|
+
+-- GPU 1: ProcessPool Worker (CUDA_VISIBLE_DEVICES=1)
|
|
35
|
+
| +-- ThreadPool (I/O)
|
|
36
|
+
| +-- Model batch inference
|
|
37
|
+
|
|
|
38
|
+
+-- GPU N: ...
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Key Implementation Steps
|
|
42
|
+
|
|
43
|
+
1. **Worker initialization with GPU isolation**
|
|
44
|
+
```python
|
|
45
|
+
def _worker_init_with_gpu(gpu_id: int) -> None:
|
|
46
|
+
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
|
47
|
+
# Initialize model here (once per worker)
|
|
48
|
+
global _model
|
|
49
|
+
_model = load_model()
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
2. **Spawn context (not fork)**
|
|
53
|
+
```python
|
|
54
|
+
ctx = mp.get_context("spawn") # Required for CUDA
|
|
55
|
+
with ProcessPoolExecutor(max_workers=n_gpus, mp_context=ctx) as executor:
|
|
56
|
+
...
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
3. **Chunk distribution**
|
|
60
|
+
```python
|
|
61
|
+
chunk_size = (n_total + n_gpus - 1) // n_gpus
|
|
62
|
+
chunks = [records[i*chunk_size:(i+1)*chunk_size] for i in range(n_gpus)]
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## I/O + Compute Pipeline
|
|
66
|
+
|
|
67
|
+
Separate I/O (disk read) from compute (GPU inference) using ThreadPool:
|
|
68
|
+
|
|
69
|
+
```python
|
|
70
|
+
def _load_images_parallel(paths: list[str], max_workers: int = 8) -> dict:
|
|
71
|
+
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
|
72
|
+
futures = {executor.submit(cv2.imread, p): p for p in paths}
|
|
73
|
+
return {futures[f]: f.result() for f in as_completed(futures)}
|
|
74
|
+
|
|
75
|
+
def process_batch_hybrid(batch: list[dict]) -> list[dict]:
|
|
76
|
+
# 1. ThreadPool I/O
|
|
77
|
+
images = _load_images_parallel([r["path"] for r in batch])
|
|
78
|
+
# 2. GPU batch inference
|
|
79
|
+
results = model.predict_batch(list(images.values()))
|
|
80
|
+
return results
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Detailed References
|
|
84
|
+
|
|
85
|
+
- **[architecture.md](references/architecture.md)**: Multi-GPU ProcessPool design, worker lifecycle, error handling
|
|
86
|
+
- **[single-gpu-patterns.md](references/single-gpu-patterns.md)**: CUDA Streams, async inference, model parallelism
|
|
87
|
+
- **[troubleshooting.md](references/troubleshooting.md)**: spawn vs fork, OOM, CUDA_VISIBLE_DEVICES issues
|
|
88
|
+
|
|
89
|
+
## Memory Planning
|
|
90
|
+
|
|
91
|
+
Before implementation, check GPU memory:
|
|
92
|
+
```bash
|
|
93
|
+
python scripts/check_gpu_memory.py
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Rule of thumb:**
|
|
97
|
+
- Workers per GPU = GPU_Memory / Model_Memory
|
|
98
|
+
- Example: 24GB GPU, 5GB model = 4 workers/GPU max
|
|
99
|
+
- Leave 2-3GB headroom for CUDA overhead
|
|
@@ -0,0 +1,194 @@
|
|
|
1
|
+
# Multi-GPU Architecture
|
|
2
|
+
|
|
3
|
+
## Table of Contents
|
|
4
|
+
- [ProcessPool with GPU Isolation](#processpool-with-gpu-isolation)
|
|
5
|
+
- [Chunk Distribution Pattern](#chunk-distribution-pattern)
|
|
6
|
+
- [Complete Multi-GPU Orchestration](#complete-multi-gpu-orchestration)
|
|
7
|
+
- [Worker Lifecycle](#worker-lifecycle)
|
|
8
|
+
- [Error Handling Strategy](#error-handling-strategy)
|
|
9
|
+
- [Progress Tracking](#progress-tracking)
|
|
10
|
+
- [Performance Considerations](#performance-considerations)
|
|
11
|
+
|
|
12
|
+
## ProcessPool with GPU Isolation
|
|
13
|
+
|
|
14
|
+
### Why ProcessPool over ThreadPool for GPU?
|
|
15
|
+
|
|
16
|
+
Python's GIL doesn't affect GPU operations, but CUDA context initialization requires process isolation for reliable multi-GPU usage. Each process should own exactly one GPU.
|
|
17
|
+
|
|
18
|
+
### CUDA_VISIBLE_DEVICES Isolation
|
|
19
|
+
|
|
20
|
+
```python
|
|
21
|
+
import os
|
|
22
|
+
import multiprocessing as mp
|
|
23
|
+
from concurrent.futures import ProcessPoolExecutor, as_completed
|
|
24
|
+
|
|
25
|
+
# Process-local state
|
|
26
|
+
_model = None
|
|
27
|
+
_gpu_id = None
|
|
28
|
+
|
|
29
|
+
def _worker_init_with_gpu(gpu_id: int) -> None:
|
|
30
|
+
"""Initialize worker with GPU isolation.
|
|
31
|
+
|
|
32
|
+
Must be called at the start of each worker process.
|
|
33
|
+
CUDA_VISIBLE_DEVICES makes this GPU appear as device:0 to PyTorch/TF.
|
|
34
|
+
"""
|
|
35
|
+
global _model, _gpu_id
|
|
36
|
+
|
|
37
|
+
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
|
38
|
+
_gpu_id = gpu_id
|
|
39
|
+
|
|
40
|
+
# Import ML framework AFTER setting CUDA_VISIBLE_DEVICES
|
|
41
|
+
import torch
|
|
42
|
+
_model = YourModel().cuda() # Now on device:0 (the isolated GPU)
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
### Chunk Distribution Pattern
|
|
46
|
+
|
|
47
|
+
```python
|
|
48
|
+
def distribute_to_gpus(records: list, n_gpus: int) -> list[tuple]:
|
|
49
|
+
"""Distribute records evenly across GPUs.
|
|
50
|
+
|
|
51
|
+
Returns list of (chunk, gpu_id, position) tuples.
|
|
52
|
+
"""
|
|
53
|
+
if n_gpus < 1:
|
|
54
|
+
raise ValueError(f"n_gpus must be >= 1, got {n_gpus}")
|
|
55
|
+
|
|
56
|
+
n_total = len(records)
|
|
57
|
+
chunk_size = (n_total + n_gpus - 1) // n_gpus # ceiling division
|
|
58
|
+
|
|
59
|
+
chunks = []
|
|
60
|
+
for i in range(n_gpus):
|
|
61
|
+
start = i * chunk_size
|
|
62
|
+
end = min(start + chunk_size, n_total)
|
|
63
|
+
if start < n_total:
|
|
64
|
+
chunks.append((records[start:end], i, i)) # (data, gpu_id, tqdm_position)
|
|
65
|
+
|
|
66
|
+
return chunks
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### Complete Multi-GPU Orchestration
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
def run_multi_gpu(
|
|
73
|
+
records: list[dict],
|
|
74
|
+
n_gpus: int = 4,
|
|
75
|
+
batch_size: int = 128,
|
|
76
|
+
) -> list[dict]:
|
|
77
|
+
"""Orchestrate multi-GPU parallel processing.
|
|
78
|
+
|
|
79
|
+
Args:
|
|
80
|
+
records: Data records to process
|
|
81
|
+
n_gpus: Number of GPUs to use
|
|
82
|
+
batch_size: Batch size per GPU
|
|
83
|
+
|
|
84
|
+
Returns:
|
|
85
|
+
Processed records with results
|
|
86
|
+
"""
|
|
87
|
+
if not records:
|
|
88
|
+
return []
|
|
89
|
+
|
|
90
|
+
# Distribute data
|
|
91
|
+
chunks = distribute_to_gpus(records, n_gpus)
|
|
92
|
+
print(f"Distributing {len(records):,} items across {len(chunks)} GPUs")
|
|
93
|
+
|
|
94
|
+
# CRITICAL: Use spawn context for CUDA
|
|
95
|
+
ctx = mp.get_context("spawn")
|
|
96
|
+
|
|
97
|
+
# Track GPU assignments for error recovery
|
|
98
|
+
gpu_to_chunk = {gpu_id: chunk for chunk, gpu_id, _ in chunks}
|
|
99
|
+
|
|
100
|
+
all_results = []
|
|
101
|
+
failed_chunks = []
|
|
102
|
+
|
|
103
|
+
with ProcessPoolExecutor(max_workers=len(chunks), mp_context=ctx) as executor:
|
|
104
|
+
futures = {
|
|
105
|
+
executor.submit(_process_gpu_chunk, chunk, gpu_id, batch_size, pos): gpu_id
|
|
106
|
+
for chunk, gpu_id, pos in chunks
|
|
107
|
+
}
|
|
108
|
+
|
|
109
|
+
for future in as_completed(futures):
|
|
110
|
+
gpu_id = futures[future]
|
|
111
|
+
try:
|
|
112
|
+
results = future.result()
|
|
113
|
+
all_results.extend(results)
|
|
114
|
+
except Exception as e:
|
|
115
|
+
print(f"[ERROR] GPU {gpu_id} failed: {e}")
|
|
116
|
+
failed_chunks.append((gpu_id, gpu_to_chunk[gpu_id]))
|
|
117
|
+
|
|
118
|
+
# Handle failures gracefully (don't lose data)
|
|
119
|
+
if failed_chunks:
|
|
120
|
+
for gpu_id, chunk in failed_chunks:
|
|
121
|
+
for record in chunk:
|
|
122
|
+
record["_error"] = f"GPU {gpu_id} failed"
|
|
123
|
+
all_results.append(record)
|
|
124
|
+
|
|
125
|
+
return all_results
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Worker Lifecycle
|
|
129
|
+
|
|
130
|
+
```
|
|
131
|
+
spawn context creates new process
|
|
132
|
+
|
|
|
133
|
+
v
|
|
134
|
+
_worker_init_with_gpu(gpu_id)
|
|
135
|
+
- Set CUDA_VISIBLE_DEVICES
|
|
136
|
+
- Import ML framework
|
|
137
|
+
- Load model to GPU
|
|
138
|
+
|
|
|
139
|
+
v
|
|
140
|
+
Process batches in loop
|
|
141
|
+
|
|
|
142
|
+
v
|
|
143
|
+
ProcessPool cleanup (model freed)
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### Error Handling Strategy
|
|
147
|
+
|
|
148
|
+
1. **Per-GPU failure isolation**: One GPU failure shouldn't crash others
|
|
149
|
+
2. **Data preservation**: Failed chunks get marked, not dropped
|
|
150
|
+
3. **Graceful degradation**: Continue with remaining GPUs
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
# Track failures
|
|
154
|
+
failed_chunks: list[tuple[int, list]] = []
|
|
155
|
+
|
|
156
|
+
try:
|
|
157
|
+
results = future.result(timeout=300) # 5-min timeout
|
|
158
|
+
except Exception as e:
|
|
159
|
+
failed_chunks.append((gpu_id, original_chunk))
|
|
160
|
+
|
|
161
|
+
# After all futures complete
|
|
162
|
+
if failed_chunks:
|
|
163
|
+
print(f"[WARN] {len(failed_chunks)} GPU(s) failed")
|
|
164
|
+
# Add failed records with error markers
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
## Progress Tracking
|
|
168
|
+
|
|
169
|
+
Use tqdm with position parameter for multi-bar display:
|
|
170
|
+
|
|
171
|
+
```python
|
|
172
|
+
from tqdm import tqdm
|
|
173
|
+
|
|
174
|
+
def _process_gpu_chunk(records, gpu_id, batch_size, position):
|
|
175
|
+
_worker_init_with_gpu(gpu_id)
|
|
176
|
+
|
|
177
|
+
batches = [records[i:i+batch_size] for i in range(0, len(records), batch_size)]
|
|
178
|
+
results = []
|
|
179
|
+
|
|
180
|
+
for batch in tqdm(batches, desc=f"GPU {gpu_id}", position=position, leave=False):
|
|
181
|
+
batch_results = process_batch(batch)
|
|
182
|
+
results.extend(batch_results)
|
|
183
|
+
|
|
184
|
+
return results
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
## Performance Considerations
|
|
188
|
+
|
|
189
|
+
| Factor | Recommendation |
|
|
190
|
+
|--------|---------------|
|
|
191
|
+
| Batch size | Start with 64-128, tune based on GPU memory |
|
|
192
|
+
| Workers per GPU | Usually 1 for large models, 2-4 for small models |
|
|
193
|
+
| I/O workers | 4-8 ThreadPool workers per GPU worker |
|
|
194
|
+
| Chunk size | Balanced across GPUs (ceiling division) |
|
|
@@ -0,0 +1,225 @@
|
|
|
1
|
+
# Single GPU Optimization Patterns
|
|
2
|
+
|
|
3
|
+
## Table of Contents
|
|
4
|
+
- [CUDA Streams for Concurrent Operations](#cuda-streams-for-concurrent-operations)
|
|
5
|
+
- [Async Inference Pattern](#async-inference-pattern)
|
|
6
|
+
- [Model Batching for Multiple Small Models](#model-batching-for-multiple-small-models)
|
|
7
|
+
- [Dynamic Batching](#dynamic-batching)
|
|
8
|
+
- [Memory Optimization](#memory-optimization)
|
|
9
|
+
- [Throughput Measurement](#throughput-measurement)
|
|
10
|
+
- [Best Practices Summary](#best-practices-summary)
|
|
11
|
+
|
|
12
|
+
## CUDA Streams for Concurrent Operations
|
|
13
|
+
|
|
14
|
+
CUDA streams allow overlapping data transfer and computation:
|
|
15
|
+
|
|
16
|
+
```python
|
|
17
|
+
import torch
|
|
18
|
+
|
|
19
|
+
def process_with_streams(batches: list, model):
|
|
20
|
+
"""Process batches using CUDA streams for overlap."""
|
|
21
|
+
streams = [torch.cuda.Stream() for _ in range(2)]
|
|
22
|
+
results = []
|
|
23
|
+
|
|
24
|
+
for i, batch in enumerate(batches):
|
|
25
|
+
stream = streams[i % 2]
|
|
26
|
+
|
|
27
|
+
with torch.cuda.stream(stream):
|
|
28
|
+
# Transfer to GPU
|
|
29
|
+
data = batch.cuda(non_blocking=True)
|
|
30
|
+
# Compute
|
|
31
|
+
output = model(data)
|
|
32
|
+
results.append(output)
|
|
33
|
+
|
|
34
|
+
# Synchronize all streams
|
|
35
|
+
torch.cuda.synchronize()
|
|
36
|
+
return results
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Async Inference Pattern
|
|
40
|
+
|
|
41
|
+
For pipelines with I/O and compute stages:
|
|
42
|
+
|
|
43
|
+
```python
|
|
44
|
+
import asyncio
|
|
45
|
+
from concurrent.futures import ThreadPoolExecutor
|
|
46
|
+
|
|
47
|
+
class AsyncInferencePipeline:
|
|
48
|
+
def __init__(self, model, io_workers: int = 4):
|
|
49
|
+
self.model = model
|
|
50
|
+
self.io_executor = ThreadPoolExecutor(max_workers=io_workers)
|
|
51
|
+
self.batch_queue = asyncio.Queue(maxsize=2) # Prefetch 2 batches
|
|
52
|
+
|
|
53
|
+
async def load_batch(self, paths: list[str]):
|
|
54
|
+
"""Load batch in thread pool (non-blocking)."""
|
|
55
|
+
loop = asyncio.get_event_loop()
|
|
56
|
+
images = await loop.run_in_executor(
|
|
57
|
+
self.io_executor,
|
|
58
|
+
lambda: [load_image(p) for p in paths]
|
|
59
|
+
)
|
|
60
|
+
return torch.stack(images)
|
|
61
|
+
|
|
62
|
+
async def producer(self, all_paths: list[str], batch_size: int):
|
|
63
|
+
"""Continuously load batches."""
|
|
64
|
+
for i in range(0, len(all_paths), batch_size):
|
|
65
|
+
batch_paths = all_paths[i:i+batch_size]
|
|
66
|
+
batch = await self.load_batch(batch_paths)
|
|
67
|
+
await self.batch_queue.put(batch)
|
|
68
|
+
await self.batch_queue.put(None) # Signal end
|
|
69
|
+
|
|
70
|
+
async def consumer(self):
|
|
71
|
+
"""Process batches as they arrive."""
|
|
72
|
+
results = []
|
|
73
|
+
while True:
|
|
74
|
+
batch = await self.batch_queue.get()
|
|
75
|
+
if batch is None:
|
|
76
|
+
break
|
|
77
|
+
with torch.no_grad():
|
|
78
|
+
output = self.model(batch.cuda())
|
|
79
|
+
results.append(output.cpu())
|
|
80
|
+
return results
|
|
81
|
+
|
|
82
|
+
async def run(self, paths: list[str], batch_size: int = 32):
|
|
83
|
+
producer_task = asyncio.create_task(self.producer(paths, batch_size))
|
|
84
|
+
results = await self.consumer()
|
|
85
|
+
await producer_task
|
|
86
|
+
return results
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Model Batching for Multiple Small Models
|
|
90
|
+
|
|
91
|
+
Run multiple small models on single GPU:
|
|
92
|
+
|
|
93
|
+
```python
|
|
94
|
+
class MultiModelPipeline:
|
|
95
|
+
"""Run multiple models efficiently on single GPU."""
|
|
96
|
+
|
|
97
|
+
def __init__(self, models: list):
|
|
98
|
+
self.models = [m.cuda() for m in models]
|
|
99
|
+
self.streams = [torch.cuda.Stream() for _ in models]
|
|
100
|
+
|
|
101
|
+
def forward_all(self, inputs: list[torch.Tensor]) -> list[torch.Tensor]:
|
|
102
|
+
"""Run all models concurrently using streams."""
|
|
103
|
+
outputs = [None] * len(self.models)
|
|
104
|
+
|
|
105
|
+
# Launch all models
|
|
106
|
+
for i, (model, stream, x) in enumerate(zip(self.models, self.streams, inputs)):
|
|
107
|
+
with torch.cuda.stream(stream):
|
|
108
|
+
outputs[i] = model(x.cuda(non_blocking=True))
|
|
109
|
+
|
|
110
|
+
# Wait for all
|
|
111
|
+
torch.cuda.synchronize()
|
|
112
|
+
return outputs
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
## Dynamic Batching
|
|
116
|
+
|
|
117
|
+
Maximize GPU utilization with variable batch sizes:
|
|
118
|
+
|
|
119
|
+
```python
|
|
120
|
+
class DynamicBatcher:
|
|
121
|
+
"""Accumulate inputs until batch is full or timeout."""
|
|
122
|
+
|
|
123
|
+
def __init__(self, model, max_batch: int = 64, timeout_ms: int = 10):
|
|
124
|
+
self.model = model
|
|
125
|
+
self.max_batch = max_batch
|
|
126
|
+
self.timeout_ms = timeout_ms
|
|
127
|
+
self.pending = []
|
|
128
|
+
self.last_submit = time.time()
|
|
129
|
+
|
|
130
|
+
def add(self, item):
|
|
131
|
+
self.pending.append(item)
|
|
132
|
+
|
|
133
|
+
should_process = (
|
|
134
|
+
len(self.pending) >= self.max_batch or
|
|
135
|
+
(time.time() - self.last_submit) * 1000 > self.timeout_ms
|
|
136
|
+
)
|
|
137
|
+
|
|
138
|
+
if should_process and self.pending:
|
|
139
|
+
return self._process_batch()
|
|
140
|
+
return None
|
|
141
|
+
|
|
142
|
+
def _process_batch(self):
|
|
143
|
+
batch = torch.stack(self.pending[:self.max_batch])
|
|
144
|
+
self.pending = self.pending[self.max_batch:]
|
|
145
|
+
self.last_submit = time.time()
|
|
146
|
+
|
|
147
|
+
with torch.no_grad():
|
|
148
|
+
return self.model(batch.cuda())
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Memory Optimization
|
|
152
|
+
|
|
153
|
+
### Gradient Checkpointing (Training)
|
|
154
|
+
|
|
155
|
+
```python
|
|
156
|
+
from torch.utils.checkpoint import checkpoint
|
|
157
|
+
|
|
158
|
+
class EfficientModel(nn.Module):
|
|
159
|
+
def forward(self, x):
|
|
160
|
+
# Checkpoint intermediate layers to save memory
|
|
161
|
+
x = checkpoint(self.layer1, x)
|
|
162
|
+
x = checkpoint(self.layer2, x)
|
|
163
|
+
x = self.head(x)
|
|
164
|
+
return x
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### Mixed Precision Inference
|
|
168
|
+
|
|
169
|
+
```python
|
|
170
|
+
with torch.cuda.amp.autocast():
|
|
171
|
+
output = model(input) # Uses FP16 automatically
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### Memory-Efficient Attention (for transformers)
|
|
175
|
+
|
|
176
|
+
```python
|
|
177
|
+
# Use torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+)
|
|
178
|
+
# Automatically uses FlashAttention when available
|
|
179
|
+
from torch.nn.functional import scaled_dot_product_attention
|
|
180
|
+
|
|
181
|
+
attn_output = scaled_dot_product_attention(q, k, v, is_causal=True)
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
## Throughput Measurement
|
|
185
|
+
|
|
186
|
+
```python
|
|
187
|
+
import time
|
|
188
|
+
import torch
|
|
189
|
+
|
|
190
|
+
def benchmark_throughput(model, input_shape, n_iterations=100, warmup=10):
|
|
191
|
+
"""Measure model throughput in samples/second."""
|
|
192
|
+
model.eval()
|
|
193
|
+
dummy_input = torch.randn(*input_shape).cuda()
|
|
194
|
+
|
|
195
|
+
# Warmup
|
|
196
|
+
for _ in range(warmup):
|
|
197
|
+
with torch.no_grad():
|
|
198
|
+
_ = model(dummy_input)
|
|
199
|
+
|
|
200
|
+
torch.cuda.synchronize()
|
|
201
|
+
start = time.perf_counter()
|
|
202
|
+
|
|
203
|
+
for _ in range(n_iterations):
|
|
204
|
+
with torch.no_grad():
|
|
205
|
+
_ = model(dummy_input)
|
|
206
|
+
|
|
207
|
+
torch.cuda.synchronize()
|
|
208
|
+
elapsed = time.perf_counter() - start
|
|
209
|
+
|
|
210
|
+
batch_size = input_shape[0]
|
|
211
|
+
throughput = (n_iterations * batch_size) / elapsed
|
|
212
|
+
print(f"Throughput: {throughput:.1f} samples/sec")
|
|
213
|
+
return throughput
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
## Best Practices Summary
|
|
217
|
+
|
|
218
|
+
| Technique | When to Use | Memory Impact |
|
|
219
|
+
|-----------|-------------|---------------|
|
|
220
|
+
| CUDA Streams | Multiple independent ops | Minimal |
|
|
221
|
+
| Async I/O | I/O bottleneck | Minimal |
|
|
222
|
+
| Multi-model | Multiple small models | +1 model per stream |
|
|
223
|
+
| Dynamic batching | Variable input rate | Configurable |
|
|
224
|
+
| Mixed precision | Large models, Ampere+ GPU | -50% |
|
|
225
|
+
| Checkpointing | Training large models | -60% (slower) |
|
|
@@ -0,0 +1,247 @@
|
|
|
1
|
+
# Troubleshooting Guide
|
|
2
|
+
|
|
3
|
+
## Table of Contents
|
|
4
|
+
- [spawn vs fork Context](#spawn-vs-fork-context)
|
|
5
|
+
- [CUDA_VISIBLE_DEVICES Issues](#cuda_visible_devices-issues)
|
|
6
|
+
- [GPU Memory OOM](#gpu-memory-oom)
|
|
7
|
+
- [Pickling Errors](#pickling-errors)
|
|
8
|
+
- [Process Hangs](#process-hangs)
|
|
9
|
+
- [Debugging Checklist](#debugging-checklist)
|
|
10
|
+
- [Quick Fixes](#quick-fixes)
|
|
11
|
+
|
|
12
|
+
## spawn vs fork Context
|
|
13
|
+
|
|
14
|
+
### Problem: Silent Failures with fork
|
|
15
|
+
|
|
16
|
+
When using `fork` context with CUDA:
|
|
17
|
+
- Worker processes inherit CUDA context from parent
|
|
18
|
+
- Functions may fail to pickle correctly
|
|
19
|
+
- Workers might return None silently instead of crashing
|
|
20
|
+
|
|
21
|
+
### Symptom
|
|
22
|
+
```
|
|
23
|
+
# Processing completes in seconds instead of hours
|
|
24
|
+
# All results are None
|
|
25
|
+
# No error messages
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### Solution: Always Use spawn
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
import multiprocessing as mp
|
|
32
|
+
|
|
33
|
+
# WRONG
|
|
34
|
+
with ProcessPoolExecutor(max_workers=4) as executor:
|
|
35
|
+
...
|
|
36
|
+
|
|
37
|
+
# CORRECT
|
|
38
|
+
ctx = mp.get_context("spawn")
|
|
39
|
+
with ProcessPoolExecutor(max_workers=4, mp_context=ctx) as executor:
|
|
40
|
+
...
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Why spawn works
|
|
44
|
+
|
|
45
|
+
| Context | Behavior | CUDA Safe |
|
|
46
|
+
|---------|----------|-----------|
|
|
47
|
+
| fork | Copy parent process memory | No |
|
|
48
|
+
| spawn | Start fresh process | Yes |
|
|
49
|
+
| forkserver | Fork from server process | Partial |
|
|
50
|
+
|
|
51
|
+
## CUDA_VISIBLE_DEVICES Issues
|
|
52
|
+
|
|
53
|
+
### Problem: All Workers Use Same GPU
|
|
54
|
+
|
|
55
|
+
Workers share parent's CUDA context if not isolated.
|
|
56
|
+
|
|
57
|
+
### Solution: Set Early in Worker
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
def _worker_init(gpu_id: int):
|
|
61
|
+
# MUST be first line - before any CUDA import
|
|
62
|
+
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
|
63
|
+
|
|
64
|
+
# NOW import PyTorch
|
|
65
|
+
import torch
|
|
66
|
+
|
|
67
|
+
# device:0 is now the isolated GPU
|
|
68
|
+
model = Model().to("cuda:0")
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Verification
|
|
72
|
+
|
|
73
|
+
```python
|
|
74
|
+
def _worker_init(gpu_id: int):
|
|
75
|
+
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
|
76
|
+
import torch
|
|
77
|
+
|
|
78
|
+
# Should print only 1 device
|
|
79
|
+
print(f"Worker {gpu_id}: {torch.cuda.device_count()} device(s)")
|
|
80
|
+
print(f"Device name: {torch.cuda.get_device_name(0)}")
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Common Mistake
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
# WRONG: Setting after import
|
|
87
|
+
import torch
|
|
88
|
+
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id) # Too late!
|
|
89
|
+
|
|
90
|
+
# WRONG: Using device index directly
|
|
91
|
+
model.to(f"cuda:{gpu_id}") # Sees all GPUs, doesn't isolate
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
## GPU Memory OOM
|
|
95
|
+
|
|
96
|
+
### Symptom
|
|
97
|
+
```
|
|
98
|
+
RuntimeError: CUDA out of memory. Tried to allocate X MiB
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Diagnosis
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
def check_memory():
|
|
105
|
+
import torch
|
|
106
|
+
for i in range(torch.cuda.device_count()):
|
|
107
|
+
props = torch.cuda.get_device_properties(i)
|
|
108
|
+
total = props.total_memory / 1e9
|
|
109
|
+
reserved = torch.cuda.memory_reserved(i) / 1e9
|
|
110
|
+
allocated = torch.cuda.memory_allocated(i) / 1e9
|
|
111
|
+
print(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, {total:.1f}GB total")
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
### Solutions
|
|
115
|
+
|
|
116
|
+
1. **Reduce batch size**
|
|
117
|
+
```python
|
|
118
|
+
batch_size = 64 # Start small, increase until OOM
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
2. **Enable mixed precision**
|
|
122
|
+
```python
|
|
123
|
+
with torch.cuda.amp.autocast():
|
|
124
|
+
output = model(input)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
3. **Clear cache between batches**
|
|
128
|
+
```python
|
|
129
|
+
torch.cuda.empty_cache() # Use sparingly, has overhead
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
4. **Reduce workers per GPU**
|
|
133
|
+
```python
|
|
134
|
+
# If model uses 8GB on 24GB GPU
|
|
135
|
+
workers_per_gpu = 24 // 8 - 1 # Leave headroom = 2 workers
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Memory Planning Formula
|
|
139
|
+
|
|
140
|
+
```
|
|
141
|
+
available_memory = total_gpu_memory - cuda_overhead (2-3GB)
|
|
142
|
+
model_memory = model_size * precision_multiplier
|
|
143
|
+
- FP32: model_params * 4 bytes
|
|
144
|
+
- FP16: model_params * 2 bytes
|
|
145
|
+
- INT8: model_params * 1 byte
|
|
146
|
+
|
|
147
|
+
workers_per_gpu = floor(available_memory / model_memory)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Pickling Errors
|
|
151
|
+
|
|
152
|
+
### Symptom
|
|
153
|
+
```
|
|
154
|
+
_pickle.PicklingError: Can't pickle <local object>
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
### Common Causes
|
|
158
|
+
|
|
159
|
+
1. **Lambda functions**
|
|
160
|
+
```python
|
|
161
|
+
# WRONG
|
|
162
|
+
executor.submit(lambda x: process(x), data)
|
|
163
|
+
|
|
164
|
+
# CORRECT
|
|
165
|
+
def process_wrapper(data):
|
|
166
|
+
return process(data)
|
|
167
|
+
executor.submit(process_wrapper, data)
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
2. **Nested functions**
|
|
171
|
+
```python
|
|
172
|
+
# WRONG
|
|
173
|
+
def outer():
|
|
174
|
+
def inner(x):
|
|
175
|
+
return x * 2
|
|
176
|
+
executor.submit(inner, data)
|
|
177
|
+
|
|
178
|
+
# CORRECT: Define at module level
|
|
179
|
+
def inner(x):
|
|
180
|
+
return x * 2
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
3. **CUDA tensors**
|
|
184
|
+
```python
|
|
185
|
+
# WRONG: Passing CUDA tensor to worker
|
|
186
|
+
executor.submit(process, tensor.cuda())
|
|
187
|
+
|
|
188
|
+
# CORRECT: Pass CPU tensor, move to GPU in worker
|
|
189
|
+
executor.submit(process, tensor.cpu())
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
## Process Hangs
|
|
193
|
+
|
|
194
|
+
### Symptom
|
|
195
|
+
- Workers never complete
|
|
196
|
+
- No progress bar updates
|
|
197
|
+
- CPU/GPU utilization drops to 0
|
|
198
|
+
|
|
199
|
+
### Diagnosis
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
# Add timeout to futures
|
|
203
|
+
for future in as_completed(futures, timeout=300):
|
|
204
|
+
try:
|
|
205
|
+
result = future.result(timeout=60)
|
|
206
|
+
except TimeoutError:
|
|
207
|
+
print(f"Worker timed out")
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Common Causes
|
|
211
|
+
|
|
212
|
+
1. **Deadlock in worker**
|
|
213
|
+
- Check for locks that never release
|
|
214
|
+
- Ensure thread-safe data structures
|
|
215
|
+
|
|
216
|
+
2. **CUDA synchronization hang**
|
|
217
|
+
```python
|
|
218
|
+
# Add sync points for debugging
|
|
219
|
+
torch.cuda.synchronize()
|
|
220
|
+
print("Sync point reached")
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
3. **I/O blocking**
|
|
224
|
+
```python
|
|
225
|
+
# Set timeouts on I/O operations
|
|
226
|
+
img = cv2.imread(path) # Can hang on network storage
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
## Debugging Checklist
|
|
230
|
+
|
|
231
|
+
1. [ ] Using spawn context?
|
|
232
|
+
2. [ ] CUDA_VISIBLE_DEVICES set before imports?
|
|
233
|
+
3. [ ] Functions defined at module level (not nested)?
|
|
234
|
+
4. [ ] No CUDA tensors passed between processes?
|
|
235
|
+
5. [ ] Sufficient GPU memory for batch size?
|
|
236
|
+
6. [ ] Timeouts set for futures?
|
|
237
|
+
7. [ ] Progress tracking (tqdm) enabled?
|
|
238
|
+
|
|
239
|
+
## Quick Fixes
|
|
240
|
+
|
|
241
|
+
| Issue | Quick Fix |
|
|
242
|
+
|-------|-----------|
|
|
243
|
+
| Silent None returns | Add spawn context |
|
|
244
|
+
| All workers on GPU 0 | Set CUDA_VISIBLE_DEVICES first |
|
|
245
|
+
| OOM | Reduce batch_size by 50% |
|
|
246
|
+
| Pickle error | Move function to module level |
|
|
247
|
+
| Process hangs | Add timeout, check I/O |
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
#!/usr/bin/env python
|
|
2
|
+
"""GPU memory check utility for parallel pipeline planning.
|
|
3
|
+
|
|
4
|
+
Reports available GPU memory and recommends workers per GPU based on model size.
|
|
5
|
+
|
|
6
|
+
Usage:
|
|
7
|
+
python check_gpu_memory.py
|
|
8
|
+
python check_gpu_memory.py --model-memory 5.0 # Specify model memory in GB
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
from __future__ import annotations
|
|
12
|
+
|
|
13
|
+
import argparse
|
|
14
|
+
import sys
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
def check_gpu_memory(model_memory_gb: float | None = None) -> None:
|
|
18
|
+
"""Check GPU memory and recommend worker count.
|
|
19
|
+
|
|
20
|
+
Args:
|
|
21
|
+
model_memory_gb: Estimated model memory usage in GB (optional)
|
|
22
|
+
"""
|
|
23
|
+
try:
|
|
24
|
+
import torch
|
|
25
|
+
except ImportError:
|
|
26
|
+
print("PyTorch not installed. Install with: pip install torch")
|
|
27
|
+
sys.exit(1)
|
|
28
|
+
|
|
29
|
+
if not torch.cuda.is_available():
|
|
30
|
+
print("CUDA not available")
|
|
31
|
+
sys.exit(1)
|
|
32
|
+
|
|
33
|
+
n_gpus = torch.cuda.device_count()
|
|
34
|
+
print(f"Found {n_gpus} GPU(s)\n")
|
|
35
|
+
print("=" * 60)
|
|
36
|
+
|
|
37
|
+
total_available = 0
|
|
38
|
+
cuda_overhead_gb = 2.5 # Reserved for CUDA context
|
|
39
|
+
|
|
40
|
+
for i in range(n_gpus):
|
|
41
|
+
props = torch.cuda.get_device_properties(i)
|
|
42
|
+
total_gb = props.total_memory / 1e9
|
|
43
|
+
available_gb = total_gb - cuda_overhead_gb
|
|
44
|
+
|
|
45
|
+
print(f"GPU {i}: {props.name}")
|
|
46
|
+
print(f" Total memory: {total_gb:.1f} GB")
|
|
47
|
+
print(f" Available (after CUDA overhead): {available_gb:.1f} GB")
|
|
48
|
+
|
|
49
|
+
if model_memory_gb:
|
|
50
|
+
workers = int(available_gb / model_memory_gb)
|
|
51
|
+
print(f" Recommended workers (for {model_memory_gb}GB model): {workers}")
|
|
52
|
+
|
|
53
|
+
total_available += available_gb
|
|
54
|
+
print()
|
|
55
|
+
|
|
56
|
+
print("=" * 60)
|
|
57
|
+
print(f"Total available memory: {total_available:.1f} GB")
|
|
58
|
+
|
|
59
|
+
if model_memory_gb:
|
|
60
|
+
total_workers = int(total_available / model_memory_gb)
|
|
61
|
+
print(f"Total recommended workers: {total_workers}")
|
|
62
|
+
print(f"\nSuggested command:")
|
|
63
|
+
print(f" --n-gpus {n_gpus} --batch-size 64")
|
|
64
|
+
|
|
65
|
+
|
|
66
|
+
def main():
|
|
67
|
+
parser = argparse.ArgumentParser(description="Check GPU memory for parallel pipeline")
|
|
68
|
+
parser.add_argument(
|
|
69
|
+
"--model-memory",
|
|
70
|
+
type=float,
|
|
71
|
+
default=None,
|
|
72
|
+
help="Estimated model memory usage in GB",
|
|
73
|
+
)
|
|
74
|
+
args = parser.parse_args()
|
|
75
|
+
|
|
76
|
+
check_gpu_memory(args.model_memory)
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
if __name__ == "__main__":
|
|
80
|
+
main()
|
|
@@ -140,11 +140,20 @@ Each option should explain implications, not just the choice itself.
|
|
|
140
140
|
When you've gathered sufficient information:
|
|
141
141
|
1. Summarize all requirements back to the user
|
|
142
142
|
2. Ask for confirmation using AskUserQuestion
|
|
143
|
-
3. Write the comprehensive spec to
|
|
143
|
+
3. Write the comprehensive spec to `.claude/spec/{YYYY-MM-DD}-{feature-name}.md`
|
|
144
|
+
- Date: Interview completion date (ISO format)
|
|
145
|
+
- Feature name: kebab-case (e.g., `dark-mode`, `user-authentication`)
|
|
146
|
+
- Create `.claude/spec/` directory if it doesn't exist
|
|
144
147
|
4. The spec should be detailed enough for implementation without further questions
|
|
145
148
|
|
|
146
149
|
## Spec Output Format
|
|
147
150
|
|
|
151
|
+
**File path**: `.claude/spec/{YYYY-MM-DD}-{feature-name}.md`
|
|
152
|
+
|
|
153
|
+
**Examples**:
|
|
154
|
+
- `.claude/spec/2026-01-20-dark-mode.md`
|
|
155
|
+
- `.claude/spec/2026-01-20-api-rate-limiting.md`
|
|
156
|
+
|
|
148
157
|
After interview completion, write a spec file with:
|
|
149
158
|
|
|
150
159
|
```markdown
|
package/package.json
CHANGED
package/.mcp.json
DELETED
|
@@ -1,35 +0,0 @@
|
|
|
1
|
-
{
|
|
2
|
-
"mcpServers": {
|
|
3
|
-
"mcpdocs": {
|
|
4
|
-
"type": "stdio",
|
|
5
|
-
"command": "npx",
|
|
6
|
-
"args": [
|
|
7
|
-
"@hapus/mcp-cache",
|
|
8
|
-
"uvx",
|
|
9
|
-
"--from",
|
|
10
|
-
"mcpdoc",
|
|
11
|
-
"mcpdoc",
|
|
12
|
-
"--urls",
|
|
13
|
-
"ClaudeCode:https://code.claude.com/docs/llms.txt",
|
|
14
|
-
"Cursor:https://docs.cursor.com/llms.txt",
|
|
15
|
-
"--allowed-domains",
|
|
16
|
-
"*"
|
|
17
|
-
]
|
|
18
|
-
},
|
|
19
|
-
"deepwiki": {
|
|
20
|
-
"type": "http",
|
|
21
|
-
"url": "https://mcp.deepwiki.com/mcp"
|
|
22
|
-
},
|
|
23
|
-
"serena": {
|
|
24
|
-
"type": "stdio",
|
|
25
|
-
"command": "uvx",
|
|
26
|
-
"args": [
|
|
27
|
-
"--from",
|
|
28
|
-
"git+https://github.com/oraios/serena",
|
|
29
|
-
"serena-mcp-server",
|
|
30
|
-
"--enable-web-dashboard",
|
|
31
|
-
"false"
|
|
32
|
-
]
|
|
33
|
-
}
|
|
34
|
-
}
|
|
35
|
-
}
|