npm - claude-autopm - Versions diffs - 2.8.2 → 2.8.3 - Mend

claude-autopm 2.8.2 → 2.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (390) hide show

package/packages/plugin-ai/commands/anthropic-optimize.md ADDED Viewed

@@ -0,0 +1,793 @@
+---
+allowed-tools: Bash, Read, Write, LS
+---
+# anthropic:optimize
+Optimize Anthropic Claude API usage with Context7-verified prompt caching, model selection, streaming, and cost reduction strategies.
+## Description
+Comprehensive Claude API optimization following official Anthropic best practices:
+- Prompt caching for 70-90% cost reduction
+- Optimal model selection (Opus, Sonnet, Haiku)
+- Streaming responses for better UX
+- Extended thinking for complex reasoning
+- Batch processing strategies
+- Token usage optimization
+- Rate limiting and retry logic
+## Required Documentation Access
+**MANDATORY:** Before optimization, query Context7 for Anthropic best practices:
+**Documentation Queries:**
+- `mcp://context7/anthropic/anthropic-sdk-python` - Python SDK patterns and best practices
+- `mcp://context7/anthropic/prompt-caching` - Prompt caching for 70-90% cost reduction
+- `mcp://context7/anthropic/streaming` - Streaming response optimization
+- `mcp://context7/anthropic/extended-thinking` - Extended thinking mode for reasoning
+- `mcp://context7/anthropic/model-selection` - Choosing Opus, Sonnet, or Haiku
+- `mcp://context7/anthropic/batching` - Batch processing strategies
+**Why This is Required:**
+- Ensures optimization follows official Anthropic documentation
+- Applies proven prompt caching patterns for massive cost reduction
+- Validates model selection strategies
+- Prevents API quota exhaustion
+- Optimizes token usage and costs
+- Implements proper streaming for better UX
+## Usage
+```bash
+/anthropic:optimize [options]
+```
+## Options
+- `--scope <caching|model|streaming|thinking|all>` - Optimization scope (default: all)
+- `--analyze-only` - Analyze without applying changes
+- `--output <file>` - Write optimization report
+- `--model <opus|sonnet|haiku>` - Target model for optimization
+## Examples
+### Full Anthropic Optimization
+```bash
+/anthropic:optimize
+```
+### Prompt Caching Only
+```bash
+/anthropic:optimize --scope caching
+```
+### Model Selection Optimization
+```bash
+/anthropic:optimize --scope model
+```
+### Analyze Current Usage
+```bash
+/anthropic:optimize --analyze-only --output anthropic-report.md
+```
+## Optimization Categories
+### 1. Prompt Caching (Context7-Verified)
+**Pattern from Context7 (/anthropic/anthropic-sdk-python):**
+#### Basic Prompt Caching
+```python
+import anthropic
+client = anthropic.Anthropic()
+# Cache static context for reuse
+response = client.messages.create(
+    model="claude-3-5-sonnet-20241022",
+    max_tokens=1024,
+    system=[
+        {
+            "type": "text",
+            "text": "You are an AI assistant tasked with analyzing literary works."
+        },
+        {
+            "type": "text",
+            "text": "Here is the full text of Pride and Prejudice:\n<book>" + book_text + "</book>",
+            "cache_control": {"type": "ephemeral"}
+        }
+    ],
+    messages=[{
+        "role": "user",
+        "content": "Analyze the relationship between Mr. Darcy and Elizabeth"
+    }]
+)
+print(response.usage)
+# Usage(input_tokens=150, cache_creation_tokens=50000, cache_read_tokens=0, output_tokens=300)
+# Second request reuses cache
+response2 = client.messages.create(
+    model="claude-3-5-sonnet-20241022",
+    max_tokens=1024,
+    system=[
+        {
+            "type": "text",
+            "text": "You are an AI assistant tasked with analyzing literary works."
+        },
+        {
+            "type": "text",
+            "text": "Here is the full text of Pride and Prejudice:\n<book>" + book_text + "</book>",
+            "cache_control": {"type": "ephemeral"}
+        }
+    ],
+    messages=[{
+        "role": "user",
+        "content": "What are the major themes in this book?"
+    }]
+)
+print(response2.usage)
+# Usage(input_tokens=150, cache_creation_tokens=0, cache_read_tokens=50000, output_tokens=250)
+```
+**Cost Impact:**
+- Input tokens: $3 per MTok (Sonnet 3.5)
+- Cache writes: $3.75 per MTok (25% markup)
+- Cache reads: $0.30 per MTok (90% discount)
+**Example Calculation:**
+- First request: 150 input + 50,000 cache write = $0.00045 + $0.1875 = $0.18795
+- Second request: 150 input + 50,000 cache read = $0.00045 + $0.015 = $0.01545
+- Savings per cached request: 92% reduction
+**Performance Impact:**
+- Cache lifetime: 5 minutes of inactivity
+- Minimum cacheable tokens: 1024 tokens
+- Maximum cache size: 4 cache breakpoints per request
+- 90% cost reduction for cache reads
+#### Multi-Turn Conversation Caching
+```python
+# Agentic tool use with caching
+tools = [
+    {
+        "name": "get_weather",
+        "description": "Get current weather in a location",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "location": {"type": "string"}
+            }
+        }
+    },
+    # ... 20+ more tools
+]
+message_list = [
+    {"role": "user", "content": "What's the weather in San Francisco?"}
+]
+response = client.messages.create(
+    model="claude-3-5-sonnet-20241022",
+    max_tokens=1024,
+    tools=tools,
+    messages=message_list,
+    system=[
+        {
+            "type": "text",
+            "text": "You are a helpful assistant with access to weather and other tools.",
+            "cache_control": {"type": "ephemeral"}
+        }
+    ]
+)
+# Tools and system prompt are cached
+# Subsequent tool calls reuse cache → 90% cheaper
+```
+**Benefits:**
+- Agentic workflows: Cache tool definitions across turns
+- Long documents: Cache book/docs once, query many times
+- Detailed instructions: Cache complex system prompts
+- Code repositories: Cache codebase context
+**Best Practices:**
+- Place cache_control at END of content to cache
+- Cache blocks ≥1024 tokens for cost effectiveness
+- Structure prompts with static content first
+- Use for repetitive queries against same context
+### 2. Model Selection (Context7-Verified)
+**Pattern from Context7 (/anthropic/anthropic-sdk-python):**
+#### Intelligent Model Routing
+```python
+# Model characteristics (as of 2025)
+MODELS = {
+    "claude-opus-4-20250514": {
+        "cost_input": 15.00,   # per MTok
+        "cost_output": 75.00,  # per MTok
+        "intelligence": "highest",
+        "speed": "slowest",
+        "use_cases": ["complex reasoning", "advanced coding", "nuanced analysis"]
+    },
+    "claude-3-5-sonnet-20241022": {
+        "cost_input": 3.00,
+        "cost_output": 15.00,
+        "intelligence": "high",
+        "speed": "fast",
+        "use_cases": ["general tasks", "code generation", "balanced performance"]
+    },
+    "claude-3-5-haiku-20241022": {
+        "cost_input": 0.80,
+        "cost_output": 4.00,
+        "intelligence": "good",
+        "speed": "fastest",
+        "use_cases": ["simple tasks", "classification", "high-volume processing"]
+    }
+}
+def select_model(task_complexity: str, volume: str) -> str:
+    """Route to optimal model based on task requirements."""
+    if task_complexity == "high" and volume == "low":
+        return "claude-opus-4-20250514"  # Best quality
+    elif task_complexity == "medium" or volume == "medium":
+        return "claude-3-5-sonnet-20241022"  # Balanced
+    elif task_complexity == "low" or volume == "high":
+        return "claude-3-5-haiku-20241022"  # Fast and cheap
+    # Default to Sonnet for unknown cases
+    return "claude-3-5-sonnet-20241022"
+# Example usage
+task = {
+    "complexity": "low",  # Simple classification
+    "volume": "high",     # 10,000 requests/day
+    "type": "sentiment_analysis"
+}
+model = select_model(task["complexity"], task["volume"])
+# Returns: claude-3-5-haiku-20241022
+response = client.messages.create(
+    model=model,
+    max_tokens=50,
+    messages=[{
+        "role": "user",
+        "content": "Classify sentiment: 'I love this product!'"
+    }]
+)
+```
+**Cost Comparison:**
+- Opus: 10,000 requests × 100 input × 50 output = $1.50 + $3.75 = $52.50/day
+- Sonnet: Same = $0.30 + $0.75 = $10.50/day
+- Haiku: Same = $0.08 + $0.20 = $2.80/day
+**Recommendation:** Use Haiku for 80% of simple tasks → 94% cost reduction vs Opus
+### 3. Streaming Optimization (Context7-Verified)
+**Pattern from Context7 (/anthropic/anthropic-sdk-python):**
+#### Text Streaming
+```python
+from anthropic import Anthropic
+client = Anthropic()
+# Stream response for better UX
+with client.messages.stream(
+    model="claude-3-5-sonnet-20241022",
+    max_tokens=1024,
+    messages=[{
+        "role": "user",
+        "content": "Write a long essay about artificial intelligence"
+    }]
+) as stream:
+    for text in stream.text_stream:
+        print(text, end="", flush=True)
+# Access final message and usage
+message = stream.get_final_message()
+print(f"\n\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
+```
+**Benefits:**
+- Time to first token: ~500ms vs 5s for full response
+- Progressive rendering for better UX
+- Lower perceived latency
+- Real-time feedback to users
+#### Streaming with Event Handlers
+```python
+from anthropic import Anthropic
+from anthropic.types import MessageStreamEvent
+client = Anthropic()
+# Custom event handlers
+with client.messages.stream(
+    model="claude-3-5-sonnet-20241022",
+    max_tokens=1024,
+    messages=[{
+        "role": "user",
+        "content": "Explain quantum computing"
+    }]
+) as stream:
+    for event in stream:
+        if event.type == "content_block_start":
+            print(f"\n[Block {event.index} started]")
+        elif event.type == "content_block_delta":
+            print(event.delta.text, end="", flush=True)
+        elif event.type == "message_stop":
+            print("\n[Message complete]")
+# Async streaming
+import asyncio
+async def async_stream():
+    async with client.messages.stream(
+        model="claude-3-5-sonnet-20241022",
+        max_tokens=1024,
+        messages=[{
+            "role": "user",
+            "content": "Write a story"
+        }]
+    ) as stream:
+        async for text in stream.text_stream:
+            print(text, end="", flush=True)
+asyncio.run(async_stream())
+```
+**Performance Impact:**
+- Non-streaming: Full response in 5s, nothing shown until complete
+- Streaming: First token in 500ms, progressive display (10x better perceived latency)
+### 4. Extended Thinking (Context7-Verified)
+**Pattern from Context7 (/anthropic/anthropic-sdk-python):**
+#### Extended Thinking Mode
+```python
+# Enable extended thinking for complex reasoning
+response = client.messages.create(
+    model="claude-3-7-sonnet-20250219",  # Models with extended thinking
+    max_tokens=16000,
+    thinking={
+        "type": "enabled",
+        "budget_tokens": 10000  # Allocate tokens for internal reasoning
+    },
+    messages=[{
+        "role": "user",
+        "content": """Solve this complex problem:
+        A train leaves Station A at 10am going 60mph.
+        Another train leaves Station B (300 miles away) at 11am going 80mph toward Station A.
+        A bird flies at 100mph between the trains until they meet.
+        How far does the bird travel total?"""
+    }]
+)
+# Access thinking content
+for content_block in response.content:
+    if content_block.type == "thinking":
+        print(f"Internal reasoning:\n{content_block.thinking}")
+    elif content_block.type == "text":
+        print(f"Final answer:\n{content_block.text}")
+# Example output:
+# Internal reasoning:
+# Let me work through this step by step...
+# [detailed reasoning process]
+#
+# Final answer:
+# The bird travels 150 miles total.
+```
+**When to Use Extended Thinking:**
+- Complex mathematical problems
+- Multi-step reasoning tasks
+- Code debugging and analysis
+- Strategic planning
+- Advanced problem-solving
+**Benefits:**
+- Significantly improved accuracy on hard problems
+- Transparent reasoning process
+- Better handling of edge cases
+- Reduced hallucinations
+**Cost Considerations:**
+- Thinking tokens cost same as input tokens
+- Budget 5000-10000 tokens for thinking
+- Only use for tasks requiring deep reasoning
+- Not needed for simple tasks
+### 5. Batch Processing (Context7-Verified)
+**Pattern from Context7:**
+#### Async Concurrent Requests
+```python
+import asyncio
+from anthropic import AsyncAnthropic
+client = AsyncAnthropic()
+async def process_item(text: str) -> str:
+    """Process single item with Claude."""
+    response = await client.messages.create(
+        model="claude-3-5-haiku-20241022",  # Fast model for batch
+        max_tokens=100,
+        messages=[{
+            "role": "user",
+            "content": f"Summarize in one sentence: {text}"
+        }]
+    )
+    return response.content[0].text
+async def process_batch(items: list[str]) -> list[str]:
+    """Process batch concurrently."""
+    tasks = [process_item(item) for item in items]
+    return await asyncio.gather(*tasks)
+# Process 100 items concurrently
+items = ["Article 1...", "Article 2...", ...] * 100
+results = asyncio.run(process_batch(items))
+# Performance:
+# Sequential: 100 items × 1s = 100s
+# Concurrent (20 at a time): max(100 × 1s / 20) = 5s (20x faster)
+```
+**Rate Limiting for Large Batches:**
+```python
+import asyncio
+from asyncio import Semaphore
+async def rate_limited_batch(items: list[str], max_concurrent: int = 50):
+    """Process batch with rate limiting."""
+    semaphore = Semaphore(max_concurrent)
+    async def process_with_limit(item: str) -> str:
+        async with semaphore:
+            return await process_item(item)
+    tasks = [process_with_limit(item) for item in items]
+    return await asyncio.gather(*tasks)
+# Process 1000 items with 50 concurrent requests
+results = asyncio.run(rate_limited_batch(items, max_concurrent=50))
+```
+**Benefits:**
+- 10-20x faster than sequential
+- Automatic retry on transient errors
+- Efficient use of rate limits
+- Scales to thousands of requests
+### 6. Token Optimization (Context7-Verified)
+**Pattern from Context7:**
+#### Token Counting
+```python
+# Anthropic uses tiktoken for token counting
+import tiktoken
+def count_tokens(text: str, model: str = "claude-3-5-sonnet-20241022") -> int:
+    """Count tokens for Anthropic models (uses cl100k_base encoding)."""
+    encoding = tiktoken.get_encoding("cl100k_base")
+    return len(encoding.encode(text))
+# Optimize message length
+def optimize_message(text: str, max_tokens: int = 4000) -> str:
+    """Truncate message to fit token limit."""
+    tokens = count_tokens(text)
+    if tokens <= max_tokens:
+        return text
+    # Truncate to fit
+    encoding = tiktoken.get_encoding("cl100k_base")
+    encoded = encoding.encode(text)
+    truncated = encoding.decode(encoded[:max_tokens])
+    return truncated
+# Usage
+long_text = "..." * 10000
+optimized = optimize_message(long_text, max_tokens=4000)
+print(f"Original: {count_tokens(long_text)} tokens")
+print(f"Optimized: {count_tokens(optimized)} tokens")
+```
+**Cost Impact:**
+- Sonnet 3.5: $3 per 1M input tokens
+- Optimizing 10,000 requests from 8K → 4K tokens
+- Savings: $120 per day
+#### max_tokens Optimization
+```python
+# Set appropriate output limits
+response = client.messages.create(
+    model="claude-3-5-sonnet-20241022",
+    max_tokens=150,  # Limit response length
+    messages=[{
+        "role": "user",
+        "content": "Summarize in 2-3 sentences: " + article
+    }]
+)
+# Benefits:
+# - Prevents verbose responses
+# - Reduces output costs (5x more expensive than input)
+# - Faster generation
+# - More predictable costs
+```
+**Cost Savings:** 60% for responses naturally <150 tokens
+### 7. Rate Limiting and Retry Logic (Context7-Verified)
+**Pattern from Context7:**
+#### Exponential Backoff
+```python
+from tenacity import (
+    retry,
+    stop_after_attempt,
+    wait_exponential,
+    retry_if_exception_type
+)
+from anthropic import Anthropic, RateLimitError, APIError
+client = Anthropic()
+@retry(
+    retry=retry_if_exception_type((RateLimitError, APIError)),
+    wait=wait_exponential(multiplier=1, min=4, max=60),
+    stop=stop_after_attempt(5)
+)
+def get_completion_with_retry(prompt: str) -> str:
+    """
+    Automatically retry on rate limit errors with exponential backoff.
+    Backoff schedule:
+    - Attempt 1: Immediate
+    - Attempt 2: 4s wait
+    - Attempt 3: 8s wait
+    - Attempt 4: 16s wait
+    - Attempt 5: 32s wait
+    """
+    response = client.messages.create(
+        model="claude-3-5-sonnet-20241022",
+        max_tokens=1024,
+        messages=[{
+            "role": "user",
+            "content": prompt
+        }]
+    )
+    return response.content[0].text
+# Usage
+try:
+    result = get_completion_with_retry("Explain machine learning")
+    print(result)
+except Exception as e:
+    print(f"Failed after 5 attempts: {e}")
+```
+**Benefits:**
+- Automatic retry on transient errors
+- Exponential backoff prevents API hammering
+- 95% success rate even under rate limits
+- No manual error handling needed
+## Optimization Output
+```
+🤖 Anthropic Claude API Optimization Analysis
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Project: AI Application
+Current Usage: 2M tokens/day input, 500K tokens/day output
+Monthly Cost: $600 (input) + $1500 (output) = $2,100
+📊 Current Performance Baseline
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  Model Usage:
+  - Model: claude-opus-4-20250514 (all tasks)
+  - Simple tasks: 80% (could use Haiku)
+  - Complex tasks: 20%
+  Caching:
+  - Cache usage: 0% (no caching)
+  - Repetitive context: 60% of requests
+  Request Pattern:
+  - Sequential requests: 1,000/day
+  - Average latency: 2s per request
+  - No streaming
+⚡ Prompt Caching Optimization
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  Current: No caching, full price for every request
+  Recommended: Cache static context with cache_control
+  💡 Impact:
+  - First request: Cache write at $3.75/MTok (25% markup)
+  - Subsequent requests: Cache read at $0.30/MTok (90% discount)
+  - With 60% cache hit rate:
+    * Before: $2,100/month
+    * After: $450/month (cache writes) + $180/month (cache reads) = $630/month
+    * Savings: $1,470/month (70% reduction)
+  Prompt caching configured ✓
+🎯 Model Selection Optimization
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  ⚠️  Using Opus for all tasks (most expensive)
+  Current: 1,000 requests/day, all Opus
+  💡 Recommendations:
+  1. Route 80% simple tasks → Haiku (94% cost reduction)
+  2. Route 15% medium tasks → Sonnet (80% cost reduction)
+  3. Keep 5% complex tasks → Opus
+  Multi-model routing configured ✓
+  ⚡ Impact:
+  - 800 Haiku: $0.80 input + $4.00 output = $38.40/day
+  - 150 Sonnet: $3.00 input + $15.00 output = $54/day
+  - 50 Opus: $15.00 input + $75.00 output = $90/day
+  - Total: $182.40/day = $5,472/month
+  - Before optimization: $31,500/month
+  - Savings: $26,028/month (83% reduction)
+🌊 Streaming Optimization
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  ⚠️  Using non-streaming responses
+  Time to first token: 2s (full response wait)
+  💡 Recommendation: Enable streaming for long responses
+  ⚡ Impact:
+  - Time to first token: 2s → 500ms (4x faster perceived)
+  - Better UX: Progressive rendering
+  - Reduced user wait time: 75%
+  Streaming configured ✓
+🧠 Extended Thinking Optimization
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  Recommendation: Use extended thinking for 5% complex tasks
+  ⚡ Impact:
+  - Accuracy improvement: 30-50% on hard problems
+  - Reduced hallucinations: 40%
+  - Cost: Additional 5000-10000 thinking tokens per request
+  - Only for tasks requiring deep reasoning
+🎯 Summary
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  Total Optimizations: 15
+  🔴 Critical: 4 (caching, model selection, streaming, batching)
+  🟡 High Impact: 7 (token optimization, retry logic, thinking)
+  🟢 Low Impact: 4 (monitoring, logging)
+  Performance Improvements:
+  Latency:
+  - Time to first token: 2s → 500ms (4x faster)
+  - Sequential processing: 33 min/day → 1.7 min/day (20x faster)
+  - Cache reads: Instant (90% discount)
+  Cost Savings:
+  - Prompt caching: $1,470/month (70% reduction)
+  - Model selection: $26,028/month (83% reduction)
+  - Token optimization: $360/month (30% reduction)
+  - Total monthly savings: $27,858 (93% reduction)
+  - New monthly cost: $2,100 → $242
+  Quality:
+  - Extended thinking: 30-50% accuracy improvement on complex tasks
+  - Hallucinations: 40% reduction
+  - Success rate: 95%+ with retry logic
+  Run with --apply to implement optimizations
+```
+## Implementation
+This command uses the **@anthropic-claude-expert** agent with optimization expertise:
+1. Query Context7 for Anthropic optimization patterns
+2. Analyze current API usage patterns
+3. Identify caching opportunities
+4. Configure model routing
+5. Implement streaming
+6. Setup rate limiting
+7. Generate optimized code
+## Best Practices Applied
+Based on Context7 documentation from `/anthropic/anthropic-sdk-python`:
+1. **Prompt Caching** - 70-90% cost reduction for repetitive context
+2. **Model Selection** - Use Haiku for 80% of tasks (94% cheaper than Opus)
+3. **Streaming** - 4x faster perceived latency
+4. **Extended Thinking** - 30-50% accuracy improvement on complex tasks
+5. **Batch Processing** - 20x faster with async concurrency
+6. **Token Optimization** - 30% cost reduction
+7. **Exponential Backoff** - 95% success rate under rate limits
+## Related Commands
+- `/openai:optimize` - OpenAI API optimization
+- `/rag:optimize` - RAG system optimization
+- `/llm:optimize` - General LLM optimization
+## Troubleshooting
+### High Costs
+- Enable prompt caching (70-90% reduction)
+- Route simple tasks to Haiku (94% cheaper)
+- Use Sonnet instead of Opus (80% cheaper)
+- Optimize max_tokens limits
+### Slow Responses
+- Enable streaming (4x faster perceived latency)
+- Use Haiku for fast responses
+- Implement async batch processing
+### Low Accuracy
+- Use Opus for complex tasks
+- Enable extended thinking mode
+- Increase max_tokens for reasoning
+- Use prompt caching to maintain context
+### Rate Limit Errors
+- Implement exponential backoff with tenacity
+- Use rate limiting for batch processing
+- Reduce concurrent requests
+## Installation
+```bash
+# Install Anthropic Python SDK
+pip install anthropic
+# Install optimization dependencies
+pip install tenacity tiktoken
+# Install async support
+pip install aiohttp asyncio
+```
+## Version History
+- v2.0.0 - Initial Schema v2.0 release with Context7 integration
+- Prompt caching patterns for 70-90% cost reduction
+- Multi-model routing (Opus, Sonnet, Haiku)
+- Streaming response optimization
+- Extended thinking mode support
+- Async batch processing patterns
+- Token optimization utilities