npm - @yeongjaeyou/claude-code-config - Versions diffs - 0.21.2 → 0.23.0 - Mend

@yeongjaeyou/claude-code-config 0.21.2 → 0.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/.claude/skills/gpu-parallel-pipeline/SKILL.md ADDED Viewed

@@ -0,0 +1,99 @@
+---
+name: gpu-parallel-pipeline
+description: Design and implement PyTorch GPU parallel processing pipelines for maximum throughput. Use when scaling workloads across multiple GPUs (ProcessPool, CUDA_VISIBLE_DEVICES isolation), optimizing single GPU utilization (CUDA Streams, async inference, model batching), or building I/O + compute pipelines (ThreadPool for loading, batch inference). Triggers on "multi-GPU", "GPU parallel", "batch inference", "CUDA isolation", "GPU utilization", "ProcessPool GPU", "PyTorch multi-GPU".
+---
+# GPU Parallel Pipeline
+## Overview
+This skill provides patterns for maximizing GPU throughput in data processing pipelines.
+**Three core patterns:**
+1. **Multi-GPU Distribution** - ProcessPool with GPU isolation via CUDA_VISIBLE_DEVICES
+2. **Single GPU Optimization** - CUDA Streams, async inference, model batching
+3. **I/O + Compute Pipeline** - ThreadPool for I/O parallelization + batch inference
+## Quick Reference
+| Pattern | Use Case | Speedup |
+|---------|----------|---------|
+| Multi-GPU ProcessPool | Large dataset, multiple GPUs | ~N x (N = GPU count) |
+| ThreadPool I/O + Batch | I/O bottleneck (image loading) | 2-5x |
+| CUDA Streams | Multiple models on single GPU | 1.5-3x |
+## Multi-GPU Architecture
+```
+Main Process (Coordinator)
+    |
+    +-- GPU 0: ProcessPool Worker (CUDA_VISIBLE_DEVICES=0)
+    |       +-- ThreadPool (I/O)
+    |       +-- Model batch inference
+    |
+    +-- GPU 1: ProcessPool Worker (CUDA_VISIBLE_DEVICES=1)
+    |       +-- ThreadPool (I/O)
+    |       +-- Model batch inference
+    |
+    +-- GPU N: ...
+```
+### Key Implementation Steps
+1. **Worker initialization with GPU isolation**
+```python
+def _worker_init_with_gpu(gpu_id: int) -> None:
+    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+    # Initialize model here (once per worker)
+    global _model
+    _model = load_model()
+```
+2. **Spawn context (not fork)**
+```python
+ctx = mp.get_context("spawn")  # Required for CUDA
+with ProcessPoolExecutor(max_workers=n_gpus, mp_context=ctx) as executor:
+    ...
+```
+3. **Chunk distribution**
+```python
+chunk_size = (n_total + n_gpus - 1) // n_gpus
+chunks = [records[i*chunk_size:(i+1)*chunk_size] for i in range(n_gpus)]
+```
+## I/O + Compute Pipeline
+Separate I/O (disk read) from compute (GPU inference) using ThreadPool:
+```python
+def _load_images_parallel(paths: list[str], max_workers: int = 8) -> dict:
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        futures = {executor.submit(cv2.imread, p): p for p in paths}
+        return {futures[f]: f.result() for f in as_completed(futures)}
+def process_batch_hybrid(batch: list[dict]) -> list[dict]:
+    # 1. ThreadPool I/O
+    images = _load_images_parallel([r["path"] for r in batch])
+    # 2. GPU batch inference
+    results = model.predict_batch(list(images.values()))
+    return results
+```
+## Detailed References
+- **[architecture.md](references/architecture.md)**: Multi-GPU ProcessPool design, worker lifecycle, error handling
+- **[single-gpu-patterns.md](references/single-gpu-patterns.md)**: CUDA Streams, async inference, model parallelism
+- **[troubleshooting.md](references/troubleshooting.md)**: spawn vs fork, OOM, CUDA_VISIBLE_DEVICES issues
+## Memory Planning
+Before implementation, check GPU memory:
+```bash
+python scripts/check_gpu_memory.py
+```
+**Rule of thumb:**
+- Workers per GPU = GPU_Memory / Model_Memory
+- Example: 24GB GPU, 5GB model = 4 workers/GPU max
+- Leave 2-3GB headroom for CUDA overhead

package/.claude/skills/gpu-parallel-pipeline/references/architecture.md ADDED Viewed

@@ -0,0 +1,194 @@
+# Multi-GPU Architecture
+## Table of Contents
+- [ProcessPool with GPU Isolation](#processpool-with-gpu-isolation)
+- [Chunk Distribution Pattern](#chunk-distribution-pattern)
+- [Complete Multi-GPU Orchestration](#complete-multi-gpu-orchestration)
+- [Worker Lifecycle](#worker-lifecycle)
+- [Error Handling Strategy](#error-handling-strategy)
+- [Progress Tracking](#progress-tracking)
+- [Performance Considerations](#performance-considerations)
+## ProcessPool with GPU Isolation
+### Why ProcessPool over ThreadPool for GPU?
+Python's GIL doesn't affect GPU operations, but CUDA context initialization requires process isolation for reliable multi-GPU usage. Each process should own exactly one GPU.
+### CUDA_VISIBLE_DEVICES Isolation
+```python
+import os
+import multiprocessing as mp
+from concurrent.futures import ProcessPoolExecutor, as_completed
+# Process-local state
+_model = None
+_gpu_id = None
+def _worker_init_with_gpu(gpu_id: int) -> None:
+    """Initialize worker with GPU isolation.
+    Must be called at the start of each worker process.
+    CUDA_VISIBLE_DEVICES makes this GPU appear as device:0 to PyTorch/TF.
+    """
+    global _model, _gpu_id
+    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+    _gpu_id = gpu_id
+    # Import ML framework AFTER setting CUDA_VISIBLE_DEVICES
+    import torch
+    _model = YourModel().cuda()  # Now on device:0 (the isolated GPU)
+```
+### Chunk Distribution Pattern
+```python
+def distribute_to_gpus(records: list, n_gpus: int) -> list[tuple]:
+    """Distribute records evenly across GPUs.
+    Returns list of (chunk, gpu_id, position) tuples.
+    """
+    if n_gpus < 1:
+        raise ValueError(f"n_gpus must be >= 1, got {n_gpus}")
+    n_total = len(records)
+    chunk_size = (n_total + n_gpus - 1) // n_gpus  # ceiling division
+    chunks = []
+    for i in range(n_gpus):
+        start = i * chunk_size
+        end = min(start + chunk_size, n_total)
+        if start < n_total:
+            chunks.append((records[start:end], i, i))  # (data, gpu_id, tqdm_position)
+    return chunks
+```
+### Complete Multi-GPU Orchestration
+```python
+def run_multi_gpu(
+    records: list[dict],
+    n_gpus: int = 4,
+    batch_size: int = 128,
+) -> list[dict]:
+    """Orchestrate multi-GPU parallel processing.
+    Args:
+        records: Data records to process
+        n_gpus: Number of GPUs to use
+        batch_size: Batch size per GPU
+    Returns:
+        Processed records with results
+    """
+    if not records:
+        return []
+    # Distribute data
+    chunks = distribute_to_gpus(records, n_gpus)
+    print(f"Distributing {len(records):,} items across {len(chunks)} GPUs")
+    # CRITICAL: Use spawn context for CUDA
+    ctx = mp.get_context("spawn")
+    # Track GPU assignments for error recovery
+    gpu_to_chunk = {gpu_id: chunk for chunk, gpu_id, _ in chunks}
+    all_results = []
+    failed_chunks = []
+    with ProcessPoolExecutor(max_workers=len(chunks), mp_context=ctx) as executor:
+        futures = {
+            executor.submit(_process_gpu_chunk, chunk, gpu_id, batch_size, pos): gpu_id
+            for chunk, gpu_id, pos in chunks
+        }
+        for future in as_completed(futures):
+            gpu_id = futures[future]
+            try:
+                results = future.result()
+                all_results.extend(results)
+            except Exception as e:
+                print(f"[ERROR] GPU {gpu_id} failed: {e}")
+                failed_chunks.append((gpu_id, gpu_to_chunk[gpu_id]))
+    # Handle failures gracefully (don't lose data)
+    if failed_chunks:
+        for gpu_id, chunk in failed_chunks:
+            for record in chunk:
+                record["_error"] = f"GPU {gpu_id} failed"
+                all_results.append(record)
+    return all_results
+```
+## Worker Lifecycle
+```
+spawn context creates new process
+    |
+    v
+_worker_init_with_gpu(gpu_id)
+    - Set CUDA_VISIBLE_DEVICES
+    - Import ML framework
+    - Load model to GPU
+    |
+    v
+Process batches in loop
+    |
+    v
+ProcessPool cleanup (model freed)
+```
+### Error Handling Strategy
+1. **Per-GPU failure isolation**: One GPU failure shouldn't crash others
+2. **Data preservation**: Failed chunks get marked, not dropped
+3. **Graceful degradation**: Continue with remaining GPUs
+```python
+# Track failures
+failed_chunks: list[tuple[int, list]] = []
+try:
+    results = future.result(timeout=300)  # 5-min timeout
+except Exception as e:
+    failed_chunks.append((gpu_id, original_chunk))
+# After all futures complete
+if failed_chunks:
+    print(f"[WARN] {len(failed_chunks)} GPU(s) failed")
+    # Add failed records with error markers
+```
+## Progress Tracking
+Use tqdm with position parameter for multi-bar display:
+```python
+from tqdm import tqdm
+def _process_gpu_chunk(records, gpu_id, batch_size, position):
+    _worker_init_with_gpu(gpu_id)
+    batches = [records[i:i+batch_size] for i in range(0, len(records), batch_size)]
+    results = []
+    for batch in tqdm(batches, desc=f"GPU {gpu_id}", position=position, leave=False):
+        batch_results = process_batch(batch)
+        results.extend(batch_results)
+    return results
+```
+## Performance Considerations
+| Factor | Recommendation |
+|--------|---------------|
+| Batch size | Start with 64-128, tune based on GPU memory |
+| Workers per GPU | Usually 1 for large models, 2-4 for small models |
+| I/O workers | 4-8 ThreadPool workers per GPU worker |
+| Chunk size | Balanced across GPUs (ceiling division) |

package/.claude/skills/gpu-parallel-pipeline/references/single-gpu-patterns.md ADDED Viewed

@@ -0,0 +1,225 @@
+# Single GPU Optimization Patterns
+## Table of Contents
+- [CUDA Streams for Concurrent Operations](#cuda-streams-for-concurrent-operations)
+- [Async Inference Pattern](#async-inference-pattern)
+- [Model Batching for Multiple Small Models](#model-batching-for-multiple-small-models)
+- [Dynamic Batching](#dynamic-batching)
+- [Memory Optimization](#memory-optimization)
+- [Throughput Measurement](#throughput-measurement)
+- [Best Practices Summary](#best-practices-summary)
+## CUDA Streams for Concurrent Operations
+CUDA streams allow overlapping data transfer and computation:
+```python
+import torch
+def process_with_streams(batches: list, model):
+    """Process batches using CUDA streams for overlap."""
+    streams = [torch.cuda.Stream() for _ in range(2)]
+    results = []
+    for i, batch in enumerate(batches):
+        stream = streams[i % 2]
+        with torch.cuda.stream(stream):
+            # Transfer to GPU
+            data = batch.cuda(non_blocking=True)
+            # Compute
+            output = model(data)
+            results.append(output)
+    # Synchronize all streams
+    torch.cuda.synchronize()
+    return results
+```
+## Async Inference Pattern
+For pipelines with I/O and compute stages:
+```python
+import asyncio
+from concurrent.futures import ThreadPoolExecutor
+class AsyncInferencePipeline:
+    def __init__(self, model, io_workers: int = 4):
+        self.model = model
+        self.io_executor = ThreadPoolExecutor(max_workers=io_workers)
+        self.batch_queue = asyncio.Queue(maxsize=2)  # Prefetch 2 batches
+    async def load_batch(self, paths: list[str]):
+        """Load batch in thread pool (non-blocking)."""
+        loop = asyncio.get_event_loop()
+        images = await loop.run_in_executor(
+            self.io_executor,
+            lambda: [load_image(p) for p in paths]
+        )
+        return torch.stack(images)
+    async def producer(self, all_paths: list[str], batch_size: int):
+        """Continuously load batches."""
+        for i in range(0, len(all_paths), batch_size):
+            batch_paths = all_paths[i:i+batch_size]
+            batch = await self.load_batch(batch_paths)
+            await self.batch_queue.put(batch)
+        await self.batch_queue.put(None)  # Signal end
+    async def consumer(self):
+        """Process batches as they arrive."""
+        results = []
+        while True:
+            batch = await self.batch_queue.get()
+            if batch is None:
+                break
+            with torch.no_grad():
+                output = self.model(batch.cuda())
+            results.append(output.cpu())
+        return results
+    async def run(self, paths: list[str], batch_size: int = 32):
+        producer_task = asyncio.create_task(self.producer(paths, batch_size))
+        results = await self.consumer()
+        await producer_task
+        return results
+```
+## Model Batching for Multiple Small Models
+Run multiple small models on single GPU:
+```python
+class MultiModelPipeline:
+    """Run multiple models efficiently on single GPU."""
+    def __init__(self, models: list):
+        self.models = [m.cuda() for m in models]
+        self.streams = [torch.cuda.Stream() for _ in models]
+    def forward_all(self, inputs: list[torch.Tensor]) -> list[torch.Tensor]:
+        """Run all models concurrently using streams."""
+        outputs = [None] * len(self.models)
+        # Launch all models
+        for i, (model, stream, x) in enumerate(zip(self.models, self.streams, inputs)):
+            with torch.cuda.stream(stream):
+                outputs[i] = model(x.cuda(non_blocking=True))
+        # Wait for all
+        torch.cuda.synchronize()
+        return outputs
+```
+## Dynamic Batching
+Maximize GPU utilization with variable batch sizes:
+```python
+class DynamicBatcher:
+    """Accumulate inputs until batch is full or timeout."""
+    def __init__(self, model, max_batch: int = 64, timeout_ms: int = 10):
+        self.model = model
+        self.max_batch = max_batch
+        self.timeout_ms = timeout_ms
+        self.pending = []
+        self.last_submit = time.time()
+    def add(self, item):
+        self.pending.append(item)
+        should_process = (
+            len(self.pending) >= self.max_batch or
+            (time.time() - self.last_submit) * 1000 > self.timeout_ms
+        )
+        if should_process and self.pending:
+            return self._process_batch()
+        return None
+    def _process_batch(self):
+        batch = torch.stack(self.pending[:self.max_batch])
+        self.pending = self.pending[self.max_batch:]
+        self.last_submit = time.time()
+        with torch.no_grad():
+            return self.model(batch.cuda())
+```
+## Memory Optimization
+### Gradient Checkpointing (Training)
+```python
+from torch.utils.checkpoint import checkpoint
+class EfficientModel(nn.Module):
+    def forward(self, x):
+        # Checkpoint intermediate layers to save memory
+        x = checkpoint(self.layer1, x)
+        x = checkpoint(self.layer2, x)
+        x = self.head(x)
+        return x
+```
+### Mixed Precision Inference
+```python
+with torch.cuda.amp.autocast():
+    output = model(input)  # Uses FP16 automatically
+```
+### Memory-Efficient Attention (for transformers)
+```python
+# Use torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+)
+# Automatically uses FlashAttention when available
+from torch.nn.functional import scaled_dot_product_attention
+attn_output = scaled_dot_product_attention(q, k, v, is_causal=True)
+```
+## Throughput Measurement
+```python
+import time
+import torch
+def benchmark_throughput(model, input_shape, n_iterations=100, warmup=10):
+    """Measure model throughput in samples/second."""
+    model.eval()
+    dummy_input = torch.randn(*input_shape).cuda()
+    # Warmup
+    for _ in range(warmup):
+        with torch.no_grad():
+            _ = model(dummy_input)
+    torch.cuda.synchronize()
+    start = time.perf_counter()
+    for _ in range(n_iterations):
+        with torch.no_grad():
+            _ = model(dummy_input)
+    torch.cuda.synchronize()
+    elapsed = time.perf_counter() - start
+    batch_size = input_shape[0]
+    throughput = (n_iterations * batch_size) / elapsed
+    print(f"Throughput: {throughput:.1f} samples/sec")
+    return throughput
+```
+## Best Practices Summary
+| Technique | When to Use | Memory Impact |
+|-----------|-------------|---------------|
+| CUDA Streams | Multiple independent ops | Minimal |
+| Async I/O | I/O bottleneck | Minimal |
+| Multi-model | Multiple small models | +1 model per stream |
+| Dynamic batching | Variable input rate | Configurable |
+| Mixed precision | Large models, Ampere+ GPU | -50% |
+| Checkpointing | Training large models | -60% (slower) |

package/.claude/skills/gpu-parallel-pipeline/references/troubleshooting.md ADDED Viewed

@@ -0,0 +1,247 @@
+# Troubleshooting Guide
+## Table of Contents
+- [spawn vs fork Context](#spawn-vs-fork-context)
+- [CUDA_VISIBLE_DEVICES Issues](#cuda_visible_devices-issues)
+- [GPU Memory OOM](#gpu-memory-oom)
+- [Pickling Errors](#pickling-errors)
+- [Process Hangs](#process-hangs)
+- [Debugging Checklist](#debugging-checklist)
+- [Quick Fixes](#quick-fixes)
+## spawn vs fork Context
+### Problem: Silent Failures with fork
+When using `fork` context with CUDA:
+- Worker processes inherit CUDA context from parent
+- Functions may fail to pickle correctly
+- Workers might return None silently instead of crashing
+### Symptom
+```
+# Processing completes in seconds instead of hours
+# All results are None
+# No error messages
+```
+### Solution: Always Use spawn
+```python
+import multiprocessing as mp
+# WRONG
+with ProcessPoolExecutor(max_workers=4) as executor:
+    ...
+# CORRECT
+ctx = mp.get_context("spawn")
+with ProcessPoolExecutor(max_workers=4, mp_context=ctx) as executor:
+    ...
+```
+### Why spawn works
+| Context | Behavior | CUDA Safe |
+|---------|----------|-----------|
+| fork | Copy parent process memory | No |
+| spawn | Start fresh process | Yes |
+| forkserver | Fork from server process | Partial |
+## CUDA_VISIBLE_DEVICES Issues
+### Problem: All Workers Use Same GPU
+Workers share parent's CUDA context if not isolated.
+### Solution: Set Early in Worker
+```python
+def _worker_init(gpu_id: int):
+    # MUST be first line - before any CUDA import
+    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+    # NOW import PyTorch
+    import torch
+    # device:0 is now the isolated GPU
+    model = Model().to("cuda:0")
+```
+### Verification
+```python
+def _worker_init(gpu_id: int):
+    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+    import torch
+    # Should print only 1 device
+    print(f"Worker {gpu_id}: {torch.cuda.device_count()} device(s)")
+    print(f"Device name: {torch.cuda.get_device_name(0)}")
+```
+### Common Mistake
+```python
+# WRONG: Setting after import
+import torch
+os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)  # Too late!
+# WRONG: Using device index directly
+model.to(f"cuda:{gpu_id}")  # Sees all GPUs, doesn't isolate
+```
+## GPU Memory OOM
+### Symptom
+```
+RuntimeError: CUDA out of memory. Tried to allocate X MiB
+```
+### Diagnosis
+```python
+def check_memory():
+    import torch
+    for i in range(torch.cuda.device_count()):
+        props = torch.cuda.get_device_properties(i)
+        total = props.total_memory / 1e9
+        reserved = torch.cuda.memory_reserved(i) / 1e9
+        allocated = torch.cuda.memory_allocated(i) / 1e9
+        print(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, {total:.1f}GB total")
+```
+### Solutions
+1. **Reduce batch size**
+```python
+batch_size = 64  # Start small, increase until OOM
+```
+2. **Enable mixed precision**
+```python
+with torch.cuda.amp.autocast():
+    output = model(input)
+```
+3. **Clear cache between batches**
+```python
+torch.cuda.empty_cache()  # Use sparingly, has overhead
+```
+4. **Reduce workers per GPU**
+```python
+# If model uses 8GB on 24GB GPU
+workers_per_gpu = 24 // 8 - 1  # Leave headroom = 2 workers
+```
+### Memory Planning Formula
+```
+available_memory = total_gpu_memory - cuda_overhead (2-3GB)
+model_memory = model_size * precision_multiplier
+  - FP32: model_params * 4 bytes
+  - FP16: model_params * 2 bytes
+  - INT8: model_params * 1 byte
+workers_per_gpu = floor(available_memory / model_memory)
+```
+## Pickling Errors
+### Symptom
+```
+_pickle.PicklingError: Can't pickle <local object>
+```
+### Common Causes
+1. **Lambda functions**
+```python
+# WRONG
+executor.submit(lambda x: process(x), data)
+# CORRECT
+def process_wrapper(data):
+    return process(data)
+executor.submit(process_wrapper, data)
+```
+2. **Nested functions**
+```python
+# WRONG
+def outer():
+    def inner(x):
+        return x * 2
+    executor.submit(inner, data)
+# CORRECT: Define at module level
+def inner(x):
+    return x * 2
+```
+3. **CUDA tensors**
+```python
+# WRONG: Passing CUDA tensor to worker
+executor.submit(process, tensor.cuda())
+# CORRECT: Pass CPU tensor, move to GPU in worker
+executor.submit(process, tensor.cpu())
+```
+## Process Hangs
+### Symptom
+- Workers never complete
+- No progress bar updates
+- CPU/GPU utilization drops to 0
+### Diagnosis
+```python
+# Add timeout to futures
+for future in as_completed(futures, timeout=300):
+    try:
+        result = future.result(timeout=60)
+    except TimeoutError:
+        print(f"Worker timed out")
+```
+### Common Causes
+1. **Deadlock in worker**
+   - Check for locks that never release
+   - Ensure thread-safe data structures
+2. **CUDA synchronization hang**
+```python
+# Add sync points for debugging
+torch.cuda.synchronize()
+print("Sync point reached")
+```
+3. **I/O blocking**
+```python
+# Set timeouts on I/O operations
+img = cv2.imread(path)  # Can hang on network storage
+```
+## Debugging Checklist
+1. [ ] Using spawn context?
+2. [ ] CUDA_VISIBLE_DEVICES set before imports?
+3. [ ] Functions defined at module level (not nested)?
+4. [ ] No CUDA tensors passed between processes?
+5. [ ] Sufficient GPU memory for batch size?
+6. [ ] Timeouts set for futures?
+7. [ ] Progress tracking (tqdm) enabled?
+## Quick Fixes
+| Issue | Quick Fix |
+|-------|-----------|
+| Silent None returns | Add spawn context |
+| All workers on GPU 0 | Set CUDA_VISIBLE_DEVICES first |
+| OOM | Reduce batch_size by 50% |
+| Pickle error | Move function to module level |
+| Process hangs | Add timeout, check I/O |

package/.claude/skills/gpu-parallel-pipeline/scripts/check_gpu_memory.py ADDED Viewed

@@ -0,0 +1,80 @@
+#!/usr/bin/env python
+"""GPU memory check utility for parallel pipeline planning.
+Reports available GPU memory and recommends workers per GPU based on model size.
+Usage:
+    python check_gpu_memory.py
+    python check_gpu_memory.py --model-memory 5.0  # Specify model memory in GB
+"""
+from __future__ import annotations
+import argparse
+import sys
+def check_gpu_memory(model_memory_gb: float | None = None) -> None:
+    """Check GPU memory and recommend worker count.
+    Args:
+        model_memory_gb: Estimated model memory usage in GB (optional)
+    """
+    try:
+        import torch
+    except ImportError:
+        print("PyTorch not installed. Install with: pip install torch")
+        sys.exit(1)
+    if not torch.cuda.is_available():
+        print("CUDA not available")
+        sys.exit(1)
+    n_gpus = torch.cuda.device_count()
+    print(f"Found {n_gpus} GPU(s)\n")
+    print("=" * 60)
+    total_available = 0
+    cuda_overhead_gb = 2.5  # Reserved for CUDA context
+    for i in range(n_gpus):
+        props = torch.cuda.get_device_properties(i)
+        total_gb = props.total_memory / 1e9
+        available_gb = total_gb - cuda_overhead_gb
+        print(f"GPU {i}: {props.name}")
+        print(f"  Total memory: {total_gb:.1f} GB")
+        print(f"  Available (after CUDA overhead): {available_gb:.1f} GB")
+        if model_memory_gb:
+            workers = int(available_gb / model_memory_gb)
+            print(f"  Recommended workers (for {model_memory_gb}GB model): {workers}")
+        total_available += available_gb
+        print()
+    print("=" * 60)
+    print(f"Total available memory: {total_available:.1f} GB")
+    if model_memory_gb:
+        total_workers = int(total_available / model_memory_gb)
+        print(f"Total recommended workers: {total_workers}")
+        print(f"\nSuggested command:")
+        print(f"  --n-gpus {n_gpus} --batch-size 64")
+def main():
+    parser = argparse.ArgumentParser(description="Check GPU memory for parallel pipeline")
+    parser.add_argument(
+        "--model-memory",
+        type=float,
+        default=None,
+        help="Estimated model memory usage in GB",
+    )
+    args = parser.parse_args()
+    check_gpu_memory(args.model_memory)
+if __name__ == "__main__":
+    main()

package/.claude/skills/translate-web-article/SKILL.md ADDED Viewed

@@ -0,0 +1,194 @@
+---
+name: translate-web-article
+description: Convert web pages to Korean markdown documents. Fetches page via firecrawl, translates text to Korean, analyzes images with VLM for Korean captions, preserves code/tables with explanations. Use for tech blogs, papers, documentation. Triggers on "translate web page", "blog to Korean", "translate this article".
+---
+# Web Article Translator
+Converts web pages to Korean markdown while analyzing images with VLM to generate context-aware Korean captions.
+## Workflow
+```
+URL Input
+    |
+    +-- Fetch page via firecrawl (markdown + links)
+    |
+    +-- Ask user options via AskUserQuestion
+    |   +-- Output directory
+    |   +-- Download images locally or not
+    |
+    +-- Process content
+    |   +-- Text: Translate to Korean (keep tech terms)
+    |   +-- Images: Download -> VLM analysis -> Korean caption
+    |   +-- Code/Tables: Keep original + add explanation
+    |
+    +-- Generate markdown file
+```
+## Step 1: Fetch Web Page
+Use firecrawl MCP:
+```
+mcp__firecrawl__firecrawl_scrape
+- url: target URL
+- formats: ["markdown", "links"]
+- onlyMainContent: true
+```
+Return error for inaccessible pages:
+- Login required
+- Paywall content
+- Blocked sites
+## Step 2: User Options
+Use AskUserQuestion to confirm:
+1. **Output directory**: Where to save translated markdown
+2. **Download images**: Save locally or keep URL references
+## Step 3: Translation Rules
+### General Text
+Translate to natural Korean.
+### Technical Terms
+Keep original English. See `references/tech-terms.md`.
+```
+Transformer, Fine-tuning, API, GPU, CUDA, Tokenizer,
+Embedding, Attention, Backbone, Checkpoint, Epoch,
+Batch Size, Learning Rate, Loss, Gradient, Weight...
+```
+### Code Blocks
+Keep original + add Korean explanation below:
+````markdown
+```python
+def train(model, data):
+    optimizer.zero_grad()
+    loss = model(data)
+    loss.backward()
+    optimizer.step()
+```
+> 이 코드는 모델 학습의 한 스텝을 수행합니다. gradient 초기화, forward pass, backward pass, weight 업데이트 순으로 진행됩니다.
+````
+### Tables
+Keep original + add Korean explanation below:
+```markdown
+| Model | Params | Score |
+|-------|--------|-------|
+| BERT  | 110M   | 89.3  |
+| GPT-2 | 1.5B   | 91.2  |
+> 이 테이블은 모델별 파라미터 수와 성능 점수를 비교합니다.
+```
+### Links
+Keep URL, translate link text only:
+```markdown
+자세한 내용은 [공식 문서](https://example.com/docs)를 참고하세요.
+```
+## Step 4: Image Processing
+### Process Flow
+1. Extract image URLs from markdown
+2. Download to `/tmp` (use scripts/download_image.sh)
+3. Analyze with Read tool (VLM auto-applied)
+4. Generate Korean caption considering surrounding context
+5. Add VLM analysis as blockquote below image (alt text is hidden in preview)
+### Caption Guidelines
+- Around 2 sentences
+- Describe image meaning and role
+- Reflect surrounding context
+- Use blockquote format for visibility in markdown preview
+Example:
+```markdown
+![Transformer 아키텍처](image_url)
+*원문 캡션*
+> Transformer 아키텍처의 전체 구조를 보여주는 다이어그램입니다. Encoder와 Decoder가 병렬로 배치되어 있으며, Multi-Head Attention 레이어가 핵심 구성요소입니다.
+```
+### Error Handling
+When image load fails:
+```markdown
+![이미지 로드 실패](original_url)
+> [경고] 이미지를 불러올 수 없습니다: {error_message}
+```
+Show warning and continue translation.
+## Step 5: Output Generation
+### File Structure
+```
+{output_dir}/
+├── {article_name}.md          # Translated markdown
+└── images/                    # Downloaded images (if selected)
+    ├── image_001.png
+    └── image_002.png
+```
+### Markdown Header
+```markdown
+# 번역된 제목
+원문: {original_url}
+번역일: {YYYY-MM-DD}
+---
+(Body starts here)
+```
+## Edge Cases
+| Scenario | Handling |
+|----------|----------|
+| Image URL inaccessible | Show warning, keep original URL, continue |
+| Login/Paywall | Return error, stop processing |
+| Document > 10,000 chars | Chunk by sections, process sequentially |
+| No images | Translate text only |
+| Non-English source | Translate from that language to Korean |
+## Scripts
+### download_image.sh
+Downloads image URL to /tmp:
+```bash
+scripts/download_image.sh "https://example.com/image.png"
+# Output: /tmp/img_<hash>.png
+```
+## References
+- `references/tech-terms.md` - Technical terms to keep in English
+## Limitations
+- Cannot process PDF directly
+- Cannot process video content
+- Dynamic JS-rendered content (if firecrawl fails)

package/.claude/skills/translate-web-article/references/tech-terms.md ADDED Viewed

@@ -0,0 +1,176 @@
+# Technical Terms (Keep Original)
+List of technical terms that should remain in English when translating to Korean.
+## Machine Learning / Deep Learning
+- Transformer
+- Attention
+- Multi-Head Attention
+- Self-Attention
+- Cross-Attention
+- Encoder
+- Decoder
+- Embedding
+- Tokenizer
+- Fine-tuning
+- Pre-training
+- Transfer Learning
+- Zero-shot
+- Few-shot
+- In-context Learning
+- Prompt
+- Prompt Engineering
+## Model Architecture
+- CNN (Convolutional Neural Network)
+- RNN (Recurrent Neural Network)
+- LSTM (Long Short-Term Memory)
+- GRU (Gated Recurrent Unit)
+- ResNet
+- BERT
+- GPT
+- T5
+- ViT (Vision Transformer)
+- CLIP
+- Diffusion
+- VAE (Variational Autoencoder)
+- GAN (Generative Adversarial Network)
+## Training
+- Loss
+- Gradient
+- Backpropagation
+- Optimizer
+- SGD (Stochastic Gradient Descent)
+- Adam
+- AdamW
+- Learning Rate
+- Batch Size
+- Epoch
+- Iteration
+- Checkpoint
+- Early Stopping
+- Regularization
+- Dropout
+- Batch Normalization
+- Layer Normalization
+## Data
+- Dataset
+- Dataloader
+- Preprocessing
+- Augmentation
+- Normalization
+- Train/Val/Test Split
+- Cross-validation
+- Overfitting
+- Underfitting
+- Generalization
+## Evaluation
+- Accuracy
+- Precision
+- Recall
+- F1 Score
+- AUC
+- ROC
+- BLEU
+- ROUGE
+- Perplexity
+- Benchmark
+## Infrastructure
+- GPU
+- CUDA
+- TPU
+- CPU
+- VRAM
+- Distributed Training
+- Data Parallel
+- Model Parallel
+- Mixed Precision
+- FP16
+- BF16
+- Quantization
+## Frameworks & Libraries
+- PyTorch
+- TensorFlow
+- JAX
+- Hugging Face
+- Transformers
+- Diffusers
+- Accelerate
+- DeepSpeed
+- FSDP
+- vLLM
+- TensorRT
+## APIs & Services
+- API
+- REST
+- gRPC
+- SDK
+- CLI
+- Endpoint
+- Inference
+- Serving
+- Deployment
+## LLM Specific
+- Context Window
+- Token
+- BPE (Byte Pair Encoding)
+- SentencePiece
+- RLHF (Reinforcement Learning from Human Feedback)
+- DPO (Direct Preference Optimization)
+- RAG (Retrieval Augmented Generation)
+- Chain-of-Thought
+- Reasoning
+- Hallucination
+- Grounding
+## Computer Vision
+- Backbone
+- Feature Extraction
+- Object Detection
+- Segmentation
+- Classification
+- Bounding Box
+- IoU (Intersection over Union)
+- mAP (mean Average Precision)
+- OCR
+## NLP
+- NER (Named Entity Recognition)
+- POS Tagging
+- Dependency Parsing
+- Sentiment Analysis
+- Text Classification
+- Summarization
+- Translation
+- Question Answering
+## Usage Note
+Keep these terms in English when translating.
+Good example:
+- "Transformer 모델을 Fine-tuning하여..." (O)
+Bad example:
+- "변환기 모델을 미세조정하여..." (X)
+When context requires explanation, add Korean in parentheses:
+- "Attention(주의 메커니즘)을 통해..."

package/.claude/skills/translate-web-article/scripts/download_image.sh ADDED Viewed

@@ -0,0 +1,45 @@
+#!/bin/bash
+# Download image from URL to /tmp directory
+# Usage: download_image.sh <image_url> [output_dir]
+# Output: Prints the local file path
+set -e
+IMAGE_URL="$1"
+OUTPUT_DIR="${2:-/tmp}"
+if [ -z "$IMAGE_URL" ]; then
+    echo "Usage: download_image.sh <image_url> [output_dir]" >&2
+    exit 1
+fi
+# Generate hash from URL for unique filename
+URL_HASH=$(echo -n "$IMAGE_URL" | md5sum | cut -d' ' -f1 | head -c 12)
+# Extract extension from URL (default to png)
+EXT=$(echo "$IMAGE_URL" | grep -oE '\.(png|jpg|jpeg|gif|webp|svg)' | tail -1 || echo ".png")
+if [ -z "$EXT" ]; then
+    EXT=".png"
+fi
+# Create output directory if needed
+mkdir -p "$OUTPUT_DIR"
+# Generate output filename
+OUTPUT_FILE="${OUTPUT_DIR}/img_${URL_HASH}${EXT}"
+# Download image
+if curl -sL -o "$OUTPUT_FILE" "$IMAGE_URL"; then
+    # Verify file is not empty
+    if [ -s "$OUTPUT_FILE" ]; then
+        echo "$OUTPUT_FILE"
+        exit 0
+    else
+        echo "Error: Downloaded file is empty" >&2
+        rm -f "$OUTPUT_FILE"
+        exit 1
+    fi
+else
+    echo "Error: Failed to download image from $IMAGE_URL" >&2
+    exit 1
+fi

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@yeongjaeyou/claude-code-config",
-  "version": "0.21.2",
+  "version": "0.23.0",
   "description": "Claude Code CLI custom commands, agents, and skills",
   "bin": {
     "claude-code-config": "./bin/cli.js"