ai-agent-rules 0.15.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. ai_agent_rules-0.15.2.dist-info/METADATA +451 -0
  2. ai_agent_rules-0.15.2.dist-info/RECORD +52 -0
  3. ai_agent_rules-0.15.2.dist-info/WHEEL +5 -0
  4. ai_agent_rules-0.15.2.dist-info/entry_points.txt +3 -0
  5. ai_agent_rules-0.15.2.dist-info/licenses/LICENSE +22 -0
  6. ai_agent_rules-0.15.2.dist-info/top_level.txt +1 -0
  7. ai_rules/__init__.py +8 -0
  8. ai_rules/agents/__init__.py +1 -0
  9. ai_rules/agents/base.py +68 -0
  10. ai_rules/agents/claude.py +123 -0
  11. ai_rules/agents/cursor.py +70 -0
  12. ai_rules/agents/goose.py +47 -0
  13. ai_rules/agents/shared.py +35 -0
  14. ai_rules/bootstrap/__init__.py +75 -0
  15. ai_rules/bootstrap/config.py +261 -0
  16. ai_rules/bootstrap/installer.py +279 -0
  17. ai_rules/bootstrap/updater.py +344 -0
  18. ai_rules/bootstrap/version.py +52 -0
  19. ai_rules/cli.py +2434 -0
  20. ai_rules/completions.py +194 -0
  21. ai_rules/config/AGENTS.md +249 -0
  22. ai_rules/config/chat_agent_hints.md +1 -0
  23. ai_rules/config/claude/CLAUDE.md +1 -0
  24. ai_rules/config/claude/agents/code-reviewer.md +121 -0
  25. ai_rules/config/claude/commands/agents-md.md +422 -0
  26. ai_rules/config/claude/commands/annotate-changelog.md +191 -0
  27. ai_rules/config/claude/commands/comment-cleanup.md +161 -0
  28. ai_rules/config/claude/commands/continue-crash.md +38 -0
  29. ai_rules/config/claude/commands/dev-docs.md +169 -0
  30. ai_rules/config/claude/commands/pr-creator.md +247 -0
  31. ai_rules/config/claude/commands/test-cleanup.md +244 -0
  32. ai_rules/config/claude/commands/update-docs.md +324 -0
  33. ai_rules/config/claude/hooks/subagentStop.py +92 -0
  34. ai_rules/config/claude/mcps.json +1 -0
  35. ai_rules/config/claude/settings.json +119 -0
  36. ai_rules/config/claude/skills/doc-writer/SKILL.md +293 -0
  37. ai_rules/config/claude/skills/doc-writer/resources/templates.md +495 -0
  38. ai_rules/config/claude/skills/prompt-engineer/SKILL.md +272 -0
  39. ai_rules/config/claude/skills/prompt-engineer/resources/prompt_engineering_guide_2025.md +855 -0
  40. ai_rules/config/claude/skills/prompt-engineer/resources/templates.md +232 -0
  41. ai_rules/config/cursor/keybindings.json +14 -0
  42. ai_rules/config/cursor/settings.json +81 -0
  43. ai_rules/config/goose/.goosehints +1 -0
  44. ai_rules/config/goose/config.yaml +55 -0
  45. ai_rules/config/profiles/default.yaml +6 -0
  46. ai_rules/config/profiles/work.yaml +11 -0
  47. ai_rules/config.py +644 -0
  48. ai_rules/display.py +40 -0
  49. ai_rules/mcp.py +369 -0
  50. ai_rules/profiles.py +187 -0
  51. ai_rules/symlinks.py +207 -0
  52. ai_rules/utils.py +35 -0
@@ -0,0 +1,855 @@
1
+ # LLM Prompt Engineering Reference Guide (November 2025)
2
+
3
+ Comprehensive guide to effective prompting techniques for modern large language models based on current research and validated best practices.
4
+
5
+ ---
6
+
7
+ ## Table of Contents
8
+
9
+ 1. [Introduction & Current Models](#1-introduction--current-models)
10
+ - [Current Model Landscape](#current-model-landscape)
11
+ - [Model Selection Framework](#model-selection-framework)
12
+
13
+ 2. [Core Prompting Techniques](#2-core-prompting-techniques)
14
+ - [Reasoning & Analysis](#reasoning--analysis)
15
+ - [Structured Output Frameworks](#structured-output-frameworks)
16
+ - [Learning Approaches](#learning-approaches)
17
+
18
+ 3. [Software Engineering Prompts](#3-software-engineering-prompts)
19
+ - [Development Workflows](#development-workflows)
20
+ - [Code Generation Patterns](#code-generation-patterns)
21
+ - [Debugging Strategies](#debugging-strategies)
22
+
23
+ 4. [Advanced Techniques](#4-advanced-techniques)
24
+ - [Agentic Patterns](#agentic-patterns)
25
+ - [RAG & Knowledge Integration](#rag--knowledge-integration)
26
+ - [Multi-Path Analysis](#multi-path-analysis)
27
+
28
+ 5. [Model-Specific Optimizations](#5-model-specific-optimizations)
29
+ - [Claude 4.5 Series](#claude-45-series)
30
+ - [GPT-5 and GPT-4.1](#gpt-5-and-gpt-41)
31
+ - [Reasoning Models (o3, DeepSeek R1)](#reasoning-models-o3-deepseek-r1)
32
+ - [Gemini 2.5](#gemini-25)
33
+
34
+ 6. [Quick Reference](#6-quick-reference)
35
+ - [Validated Techniques](#validated-techniques)
36
+ - [Debunked Myths](#debunked-myths)
37
+ - [Common Pitfalls](#common-pitfalls)
38
+ - [Performance Benchmarks](#performance-benchmarks)
39
+
40
+ ---
41
+
42
+ ## 1. Introduction & Current Models
43
+
44
+ ### Current Model Landscape
45
+
46
+ | Model | Version | Best For | Context Window | Key Feature |
47
+ |-------|---------|----------|----------------|-------------|
48
+ | **Claude Sonnet 4.5** | Sept 2025 | Default choice, coding, agents | 200K (1M beta) | Extended thinking, 77.2% SWE-bench |
49
+ | **Claude Haiku 4.5** | Oct 2025 | Speed, cost optimization | 200K | 2-5x faster, frontier performance |
50
+ | **Claude Opus 4.1** | Aug 2025 | Maximum capability | 200K | Best for complex refactoring |
51
+ | **GPT-5** | Oct 2025 | General purpose | Large | 2 trillion parameters |
52
+ | **GPT-4.1** | 2025 | Agentic tasks | Large | Optimized for tool use |
53
+ | **o3 / o3-mini** | 2025 | Deep reasoning | N/A | Specialized reasoning model |
54
+ | **DeepSeek R1** | Jan 2025 | Cost-effective reasoning | 128K | 27x cheaper than o3, open source |
55
+ | **Gemini 2.5 Pro** | 2025 | Multimodal | Large | Best pricing ($1.25/$10 per M tokens) |
56
+
57
+ ### Model Selection Framework
58
+
59
+ **Claude Sonnet 4.5:** Default for most tasks, coding, autonomous agents (30+ hour focus), computer use. Best capability/cost balance.
60
+
61
+ **Claude Haiku 4.5:** Speed-critical (2-5x faster), high-volume processing, sub-agents, cost optimization. Frontier performance at 1/3 price.
62
+
63
+ **Claude Opus 4.1:** Maximum capability needs, complex multi-file refactoring, intricate multi-agent frameworks when cost isn't primary concern.
64
+
65
+ **GPT-5:** Broad general knowledge, non-coding tasks, massive parameter count benefits.
66
+
67
+ **o3 or DeepSeek R1:** Deep reasoning, mathematical/logical proofs, scientific analysis. DeepSeek for budget constraints (27x cheaper).
68
+
69
+ **Gemini 2.5 Pro:** Multimodal inputs, cost optimization at high volumes, competitive pricing with strong performance.
70
+
71
+ ---
72
+
73
+ ## 2. Core Prompting Techniques
74
+
75
+ ### Reasoning & Analysis
76
+
77
+ #### Chain-of-Thought (CoT) Prompting
78
+
79
+ **What it is:** Instructs the model to show reasoning step-by-step before providing a final answer.
80
+
81
+ **When to use:** Complex problems requiring logical reasoning, mathematical calculations, multi-step analysis, debugging.
82
+
83
+ **How to use:**
84
+
85
+ ```
86
+ Solve this problem step by step:
87
+ A company's revenue was $500K in Q1. In Q2, it increased by 15%.
88
+ In Q3, it decreased by 10% from Q2. What was the Q3 revenue?
89
+
90
+ Show your work:
91
+ 1. Calculate Q2 revenue
92
+ 2. Calculate Q3 revenue
93
+ 3. Provide the final answer
94
+ ```
95
+
96
+ **Why it works:** Forces sequential processing, reducing errors. Research: 80.2% accuracy vs 34% baseline. "Take a deep breath and work step-by-step" is validated to improve performance.
97
+
98
+ **Advanced variants:**
99
+ - **Chain-of-Table:** Tabular data analysis, 8.69% improvement over standard CoT
100
+ - **Plan-and-Solve:** Generates plan first, then executes, 7-27% gains
101
+
102
+ #### Tree of Thought (ToT)
103
+
104
+ **What it is:** Explores multiple reasoning paths simultaneously, then evaluates which leads to the best solution.
105
+
106
+ **When to use:** Decision-making with multiple options, strategic planning, trade-off analysis.
107
+
108
+ **How to use:**
109
+
110
+ ```
111
+ Analyze this decision using Tree of Thought:
112
+
113
+ Decision: Should we migrate our database from PostgreSQL to MongoDB?
114
+
115
+ Explore three paths:
116
+ Path 1: Migrate fully to MongoDB
117
+ - List all advantages
118
+ - List all disadvantages
119
+ - Estimate effort and risk
120
+
121
+ Path 2: Keep PostgreSQL
122
+ - List reasons to stay
123
+ - List current pain points
124
+ - Estimate opportunity cost
125
+
126
+ Path 3: Hybrid approach (both databases)
127
+ - Describe the hybrid architecture
128
+ - List pros and cons
129
+ - Estimate complexity
130
+
131
+ After exploring all three paths, evaluate which approach is best and why.
132
+ ```
133
+
134
+ **Why it works:** Prevents premature optimization and tunnel vision through systematic exploration.
135
+
136
+ #### Self-Consistency for Reliability
137
+
138
+ **What it is:** Generate multiple independent analyses, then identify themes appearing consistently across all.
139
+
140
+ **When to use:** High-stakes decisions, complex problems with uncertainty, validating outputs, reducing single-path errors.
141
+
142
+ **How to use:**
143
+
144
+ ```
145
+ Generate 3 different analyses of employee retention challenges in tech companies:
146
+
147
+ Analysis 1: From an organizational psychology perspective
148
+ Analysis 2: From a labor economics perspective
149
+ Analysis 3: From an HR management perspective
150
+
151
+ After completing all three, identify convergent themes and recommendations.
152
+ ```
153
+
154
+ **Why it works:** Multiple reasoning chains converging on same conclusions catch model uncertainty and single-path errors.
155
+
156
+ #### Extended Thinking (Claude-Specific)
157
+
158
+ **What it is:** Claude 4.5 performs additional reasoning before response generation, visible to you.
159
+
160
+ **When to use:** Complex coding, deep analysis, multi-step logical reasoning, problems requiring careful consideration.
161
+
162
+ **How to invoke:**
163
+
164
+ Via API:
165
+ ```json
166
+ {
167
+ "model": "claude-sonnet-4-5-20250929",
168
+ "max_tokens": 16000,
169
+ "thinking": {
170
+ "type": "enabled",
171
+ "budget_tokens": 10000
172
+ },
173
+ "messages": [...]
174
+ }
175
+ ```
176
+
177
+ Via Claude.ai: Toggle "Extended thinking" mode (requires Pro/Max/Team/Enterprise)
178
+
179
+ **Why it works:** Dedicated computational budget for reasoning achieves 5-7% performance gains. Thinking happens BEFORE response.
180
+
181
+ **Key notes:** Minimum 1,024 tokens, charged at output rates, supports tool use (beta).
182
+
183
+ ### Structured Output Frameworks
184
+
185
+ #### CO-STAR Framework for Writing
186
+
187
+ **Components:** **C**ontext, **O**bjective, **S**tyle, **T**one, **A**udience, **R**esponse format
188
+
189
+ **When to use:** Marketing copy, technical documentation, business communications, content creation.
190
+
191
+ **How to use:**
192
+
193
+ ```
194
+ Context: Launching new API feature for real-time webhook notifications for payment events.
195
+
196
+ Objective: Write product announcement blog post generating excitement and driving adoption.
197
+
198
+ Style: Technical but accessible, conversational yet professional
199
+
200
+ Tone: Enthusiastic and developer-friendly
201
+
202
+ Audience: Software developers and engineering teams integrating with our payment API
203
+
204
+ Response format:
205
+ - Catchy headline
206
+ - Opening paragraph (2-3 sentences)
207
+ - "What's New" section with technical details
208
+ - "Why This Matters" section
209
+ - "Getting Started" section with code example
210
+ - Call-to-action
211
+ ```
212
+
213
+ **Why it works:** Won prompt engineering competitions. Ensures all critical dimensions are specified.
214
+
215
+ #### ROSES Framework for Decisions
216
+
217
+ **Components:** **R**ole, **O**bjective, **S**cenario, **E**xpected Output, **S**tyle
218
+
219
+ **When to use:** Strategic business decisions, technical architecture choices, resource allocation, risk assessment.
220
+
221
+ **How to use:**
222
+
223
+ ```
224
+ Role: Act as a CTO evaluating infrastructure decisions
225
+
226
+ Objective: Decide whether to adopt Kubernetes for our microservices architecture
227
+
228
+ Scenario:
229
+ - Currently running 12 microservices on EC2 with manual deployment
230
+ - Team of 8 engineers, 3 have container experience
231
+ - Growing 30% quarter-over-quarter
232
+ - Need to improve deployment speed and reliability
233
+
234
+ Expected Output:
235
+ - Recommendation (Yes/No/Phased)
236
+ - 3-5 key decision factors
237
+ - Risk assessment
238
+ - Implementation timeline if recommended
239
+
240
+ Style: Data-driven, practical, focused on team capabilities and business impact
241
+ ```
242
+
243
+ **Why it works:** Structures complex decisions with clear boundaries, producing actionable recommendations vs generic analysis.
244
+
245
+ ### Learning Approaches
246
+
247
+ #### Few-Shot Learning
248
+
249
+ **What it is:** Providing 3-5 examples of desired input-output pattern before asking model to perform task.
250
+
251
+ **When to use:** Custom output formats, specific writing styles, pattern matching, domain conventions. **Standard models ONLY** (NOT reasoning models).
252
+
253
+ **How to use:**
254
+
255
+ ```
256
+ Extract key information from product reviews into structured format.
257
+
258
+ Example 1:
259
+ Input: "Great laptop, fast processor but battery life could be better. Worth the price."
260
+ Output: {sentiment: "positive", pros: ["fast processor", "good value"], cons: ["battery life"], rating_implied: 4}
261
+
262
+ Example 2:
263
+ Input: "Terrible build quality. Broke after 2 weeks. Don't waste your money."
264
+ Output: {sentiment: "negative", pros: [], cons: ["build quality", "durability"], rating_implied: 1}
265
+
266
+ Now extract information from this review:
267
+ "Amazing screen quality and super lightweight. The speakers are weak though. Great for travel."
268
+ ```
269
+
270
+ **Why it works:** Shows model exact pattern, reducing ambiguity. Effective with standard models (GPT-4, Claude Sonnet, Gemini).
271
+
272
+ **CRITICAL WARNING:** Few-shot **harms** reasoning models (o1, o3, DeepSeek R1). Use zero-shot for these models.
273
+
274
+ #### Zero-Shot Prompting
275
+
276
+ **What it is:** Task instructions without examples, relying on model's pre-training.
277
+
278
+ **When to use:** Reasoning models (o1, o3, DeepSeek R1) - **REQUIRED**, simple common tasks, when examples might bias output.
279
+
280
+ **How to use:**
281
+
282
+ ```
283
+ Analyze this code for potential security vulnerabilities:
284
+
285
+ [CODE HERE]
286
+
287
+ For each vulnerability found:
288
+ - Describe the issue
289
+ - Explain the potential impact
290
+ - Provide a secure alternative
291
+ ```
292
+
293
+ **Why it works:** Adding examples to reasoning models interferes with their internal reasoning process.
294
+
295
+ **Best practice:** Start zero-shot. Add few-shot only if: (1) NOT using reasoning model, (2) output quality insufficient, (3) need very specific format.
296
+
297
+ ---
298
+
299
+ ## 3. Software Engineering Prompts
300
+
301
+ ### Development Workflows
302
+
303
+ #### Architecture-First Prompting
304
+
305
+ **Pattern:** Context → Goal → Constraints → Technical Requirements
306
+
307
+ **When to use:** New features, refactoring, integrating with existing codebases, complex implementations.
308
+
309
+ **How to use:**
310
+
311
+ ```
312
+ Context: Express.js API with PostgreSQL. JWT tokens stored in PostgreSQL sessions table.
313
+
314
+ Goal: Implement rate limiting to prevent API abuse on authentication endpoints.
315
+
316
+ Constraints:
317
+ - Must work with existing JWT auth system
318
+ - Should not add latency >10ms
319
+ - Scale across multiple API server instances
320
+ - Cannot require additional database queries per request
321
+
322
+ Technical Requirements:
323
+ - Use Redis for distributed rate limit tracking
324
+ - Implement sliding window algorithm
325
+ - Different limits for authenticated vs unauthenticated
326
+ - Configurable limits per endpoint
327
+
328
+ Provide architecture design first, then implement core rate limiting middleware.
329
+ ```
330
+
331
+ **Why it works:** Prevents code solving wrong problem or violating constraints. Establishes clear boundaries upfront.
332
+
333
+ #### Security-First Two-Stage Prompting
334
+
335
+ **What it is:** Generate functional code first, then explicitly harden for security.
336
+
337
+ **When to use:** Code handling user input, authentication/authorization, payments, database queries, file operations, API integrations.
338
+
339
+ **How to use:**
340
+
341
+ Stage 1:
342
+ ```
343
+ Implement user registration endpoint:
344
+ - Accept email, password, username
345
+ - Validate email format
346
+ - Hash password with bcrypt
347
+ - Store in PostgreSQL users table
348
+ - Return success/error response
349
+ ```
350
+
351
+ Stage 2:
352
+ ```
353
+ Review the registration endpoint for security vulnerabilities:
354
+
355
+ [PASTE CODE FROM STAGE 1]
356
+
357
+ Harden against: SQL injection, email injection, password policy (min 12 chars, complexity),
358
+ rate limiting, input sanitization, error message information disclosure.
359
+ ```
360
+
361
+ **Why it works:** 40%+ of AI code has vulnerabilities without security prompting. Two-stage reduces by 50%+. Catches missing input validation (16-18% of code) and hardcoded credentials.
362
+
363
+ #### Test-Driven Development with AI
364
+
365
+ **What it is:** Write comprehensive test cases first, then AI implements code until tests pass.
366
+
367
+ **When to use:** Critical business logic, complex algorithms, reducing hallucination, ensuring correctness, refactoring.
368
+
369
+ **How to use:**
370
+
371
+ ```
372
+ Write comprehensive test cases for a function validating credit card numbers using Luhn algorithm.
373
+
374
+ Cover: Valid cards (Visa, MasterCard, Amex), invalid cards (wrong checksum), edge cases
375
+ (all zeros, all nines, single digit), invalid input (non-numeric, null, undefined, empty),
376
+ length validation (too short, too long).
377
+
378
+ First provide test suite, then implement function making all tests pass.
379
+ ```
380
+
381
+ **Why it works:** Tests act as formal specification, dramatically reducing hallucination with concrete pass/fail criteria.
382
+
383
+ ### Code Generation Patterns
384
+
385
+ #### Explicit Instruction Following (Claude 4)
386
+
387
+ **Why this matters:** Claude 4 follows instructions precisely but won't infer unstated requirements or add features not requested.
388
+
389
+ **How to adapt:**
390
+
391
+ ❌ Too implicit: `Create a user profile component`
392
+
393
+ ✅ Explicit:
394
+ ```
395
+ Create a React user profile component with:
396
+ - Props: userId (string), onEdit (callback), readOnly (boolean)
397
+ - Display: avatar image, username, email, bio
398
+ - Edit button (only shown when readOnly=false)
399
+ - Loading state while fetching user data
400
+ - Error state with retry button if fetch fails
401
+ - Use Tailwind for styling
402
+ - TypeScript with proper type definitions
403
+ ```
404
+
405
+ **Key principles:** State all requirements, specify error handling, define expected behaviors, list edge cases, provide context about WHY.
406
+
407
+ **Why it works:** Claude 4's architecture prioritizes instruction-following over inference.
408
+
409
+ #### Iterative Refinement Pattern
410
+
411
+ **What it is:** Generate initial code, then iteratively improve specific aspects in separate prompts.
412
+
413
+ **When to use:** Complex implementations, uncertain requirements, performance optimization, code review.
414
+
415
+ **Pattern:**
416
+ 1. Basic implementation
417
+ 2. Add features
418
+ 3. Optimize
419
+ 4. Production-ready (error handling, logging, types, docs)
420
+
421
+ **Why it works:** Breaks complexity into manageable pieces. Each iteration focuses on specific aspects. Validate direction before adding complexity.
422
+
423
+ ### Debugging Strategies
424
+
425
+ #### Structured Debugging Pattern
426
+
427
+ **Pattern:** Error Message → Stack Trace → Context → Expected Behavior
428
+
429
+ **How to use:**
430
+
431
+ ```
432
+ Error Message:
433
+ TypeError: Cannot read property 'map' of undefined at UserList.render (UserList.jsx:23)
434
+
435
+ Stack Trace:
436
+ at UserList.render (UserList.jsx:23:15)
437
+ at renderComponent (react-dom.js:1847)
438
+
439
+ Context:
440
+ - React component displaying user list
441
+ - Data from API endpoint /api/users
442
+ - Renders correctly on initial load
443
+ - Error on "Refresh" button click
444
+ - API call succeeds (200 with valid JSON)
445
+
446
+ Code: [PASTE RELEVANT CODE]
447
+
448
+ Expected: Refresh button fetches fresh data and re-renders without errors.
449
+
450
+ What's causing this and how do I fix it?
451
+ ```
452
+
453
+ **Why it works:** Complete context enables root cause analysis vs guessing.
454
+
455
+ #### First Principles Debugging
456
+
457
+ **What it is:** Ask model to explain root cause from first principles rather than jumping to solutions.
458
+
459
+ **When to use:** Persistent bugs, unexpected behavior in complex systems, learning opportunities.
460
+
461
+ **How to use:**
462
+
463
+ ```
464
+ PostgreSQL query extremely slow (5+ seconds) despite index on queried column.
465
+
466
+ Query: SELECT * FROM orders WHERE user_id = 123 AND created_at > '2025-01-01'
467
+ Index: CREATE INDEX idx_orders_user_id ON orders(user_id)
468
+ Table size: 10 million rows
469
+
470
+ Explain from first principles:
471
+ 1. How PostgreSQL should be using the index
472
+ 2. Why the index might not be helping
473
+ 3. What's actually happening during query execution
474
+ 4. The root cause of the slowness
475
+
476
+ Then suggest fix with explanation of why it works.
477
+ ```
478
+
479
+ **Why it works:** Forces understanding vs pattern matching. Produces learning and prevents recurrence.
480
+
481
+ ---
482
+
483
+ ## 4. Advanced Techniques
484
+
485
+ ### Agentic Patterns
486
+
487
+ #### ReAct Pattern (Reason + Act)
488
+
489
+ **Pattern:** Thought → Action → Observation → Thought → Action → ...
490
+
491
+ **When to use:** Multi-step tasks with tools, research/information gathering, complex problem-solving, autonomous agents.
492
+
493
+ **How to use:**
494
+
495
+ ```
496
+ You are an agent using tools to answer questions. Follow ReAct pattern:
497
+
498
+ Thought: [Reason about what you need to do next]
499
+ Action: [Tool to use and how]
500
+ Observation: [Result from tool]
501
+ [Repeat until final answer]
502
+
503
+ Available tools:
504
+ - search(query): Search web
505
+ - calculate(expression): Evaluate math
506
+ - fetch_url(url): Get content from URL
507
+
508
+ Question: What was the GDP of the country that hosted the 2020 Olympics in the year they hosted?
509
+ ```
510
+
511
+ **Why it works:** 20-30% improvement over direct prompting. Explicit reasoning loop prevents actions without thinking.
512
+
513
+ **Tips:** Define tools clearly, require "Thought" before "Action", allow multiple iterations, parse observations into loop.
514
+
515
+ #### Reflexion Pattern
516
+
517
+ **Pattern:** Attempt → Evaluate → Reflect → Retry
518
+
519
+ **When to use:** Optimization tasks, learning from errors, iterative improvement.
520
+
521
+ **How to use:**
522
+
523
+ ```
524
+ Solve this using Reflexion pattern:
525
+
526
+ Problem: Implement function finding longest palindromic substring.
527
+
528
+ Attempt 1: [Implementation]
529
+ Evaluation: Test with "babad", "cbbd", "a", "ac"
530
+ Reflection: What worked? Failed? Why? Try differently?
531
+ Attempt 2: [Improved implementation]
532
+ [Repeat until tests pass]
533
+ ```
534
+
535
+ **Why it works:** 91% pass@1 on HumanEval. Self-reflection catches initial errors.
536
+
537
+ #### Multi-Agent Patterns
538
+
539
+ **Common patterns:** Hierarchical (manager delegates), Sequential (pipeline), Collaborative (multiple inputs synthesized).
540
+
541
+ **When to use:** Very complex tasks, quality validation, parallel exploration, large-scale projects.
542
+
543
+ **Example - Collaborative:**
544
+
545
+ ```
546
+ Design new API endpoint for payment processing.
547
+
548
+ Agent 1 (API Designer): Design endpoint spec (method, path, schemas, status codes, auth)
549
+ Agent 2 (Security Reviewer): Review for auth vulnerabilities, authorization, input validation, info disclosure
550
+ Agent 3 (Performance Engineer): Analyze latency, query efficiency, caching, scalability
551
+ Final Synthesizer: Combine insights into final design balancing all concerns.
552
+ ```
553
+
554
+ **Why it works:** Specialization enables deep focus. Multiple perspectives catch more issues. Higher quality but higher token cost.
555
+
556
+ ### RAG & Knowledge Integration
557
+
558
+ #### Preventing Hallucination with RAG
559
+
560
+ **What it is:** Provide specific documents and instruct model to answer ONLY from that information.
561
+
562
+ **When to use:** Questions about specific documents, fact-based responses with citations, domain knowledge, preventing confabulation.
563
+
564
+ **How to use:**
565
+
566
+ ```
567
+ Based ONLY on the following documents, answer the question.
568
+ If answer not in documents, say "I cannot answer based on provided documents."
569
+
570
+ Document 1: [CONTENT]
571
+ Document 2: [CONTENT]
572
+ Document 3: [CONTENT]
573
+
574
+ Question: [USER QUESTION]
575
+
576
+ Answer based only on documents above. Cite which document(s) used.
577
+ ```
578
+
579
+ **Why it works:** "Based ONLY on" prevents parametric knowledge use, dramatically reducing hallucination.
580
+
581
+ **Best practices:** Use "ONLY"/"exclusively", explicitly instruct to say when doesn't know, request citations, keep documents focused (quality > quantity).
582
+
583
+ #### Context Window Optimization
584
+
585
+ **Key finding:** Models pay most attention to beginning and end ("lost in the middle" problem).
586
+
587
+ **Best practices:**
588
+
589
+ 1. **Critical information positioning:** Most important at start, supporting in middle, restate key context with query at end.
590
+
591
+ 2. **Structured with markers:**
592
+ ```
593
+ <critical_information>[Most important]</critical_information>
594
+ <background>[Additional context]</background>
595
+ <task>[What you want done]</task>
596
+ ```
597
+
598
+ 3. **Relevance ordering:** Most relevant first, least relevant middle, second-most relevant end.
599
+
600
+ **Why it works:** Combats attention decay across long contexts.
601
+
602
+ ### Multi-Path Analysis
603
+
604
+ #### Ensemble Methods
605
+
606
+ **What it is:** Generate multiple independent solutions and synthesize best answer.
607
+
608
+ **When to use:** High-stakes decisions, complex problems with multiple approaches, quality assurance, creative tasks.
609
+
610
+ **How to use:**
611
+
612
+ ```
613
+ Generate 5 different database schemas for social media app with users, posts, comments, likes.
614
+
615
+ Solution 1-5: [GENERATE SCHEMAS]
616
+
617
+ Analyze all 5:
618
+ - Strengths of each?
619
+ - Trade-offs?
620
+ - Which aspects should be combined?
621
+
622
+ Provide final synthesized solution combining best elements.
623
+ ```
624
+
625
+ **Why it works:** Explores solution space thoroughly. Synthesis catches individual weaknesses. Higher quality than single-shot.
626
+
627
+ #### Debate Pattern
628
+
629
+ **What it is:** Model argues different positions before reaching conclusion.
630
+
631
+ **When to use:** Controversial decisions, trade-off analysis, surfacing hidden assumptions, avoiding bias.
632
+
633
+ **How to use:**
634
+
635
+ ```
636
+ Debate: "Should we use microservices or monolith?"
637
+
638
+ Round 1: Both advocates make strongest cases
639
+ Round 2: Both rebut opposing arguments
640
+ Round 3: Identify common ground and key trade-offs
641
+ Final synthesis: Nuanced recommendation based on debate
642
+ ```
643
+
644
+ **Why it works:** Forces multiple perspectives. Reveals assumptions and argument weaknesses. More balanced conclusions.
645
+
646
+ ---
647
+
648
+ ## 5. Model-Specific Optimizations
649
+
650
+ ### Claude 4.5 Series
651
+
652
+ **Key characteristics:** Requires explicit prompting. No "above and beyond" behavior. Won't add unstated features. Requires explicit error handling. Needs context about WHY. Responds better to positive framing.
653
+
654
+ #### XML Tags for Structure
655
+
656
+ **When to use:** Complex prompts with multiple sections, clear separation of examples/instructions/context, nested hierarchies.
657
+
658
+ **How to use:**
659
+
660
+ ```
661
+ <context>
662
+ REST API for e-commerce. JWT authentication. PostgreSQL storage. RESTful conventions.
663
+ </context>
664
+
665
+ <requirements>
666
+ Create endpoint for updating product inventory:
667
+ - PUT /products/{productId}/inventory
668
+ - Requires authentication
669
+ - Accepts quantity (integer) in body
670
+ - Validates quantity non-negative
671
+ - Returns updated product with new inventory
672
+ </requirements>
673
+
674
+ <constraints>
675
+ - Use existing auth middleware
676
+ - Validate user has "inventory_manager" role
677
+ - Use database transactions
678
+ - Log inventory changes for audit
679
+ </constraints>
680
+
681
+ <examples>
682
+ <example>
683
+ <request>PUT /products/123/inventory with quantity: 50</request>
684
+ <response>{id: 123, name: "Widget", inventory: 50, updated_at: "2025-11-24T10:00:00Z"}</response>
685
+ </example>
686
+ </examples>
687
+
688
+ Implement this endpoint.
689
+ ```
690
+
691
+ **Why it works:** Claude's training specifically recognizes XML tags for better parsing.
692
+
693
+ #### Best Practices
694
+
695
+ 1. **Be extremely explicit:** List all requirements including error handling, loading states, edge cases, styling, types.
696
+ 2. **Provide WHY context:** Explain reason for requirements (e.g., HIPAA compliance for strict passwords).
697
+ 3. **Positive framing:** "Return descriptive error messages" not "Don't return error codes".
698
+ 4. **Match format:** JSON prompts for JSON output, code structure for code output, markdown for markdown.
699
+
700
+ ### GPT-5 and GPT-4.1
701
+
702
+ **Key characteristics:** Latest general-purpose (GPT-5: 2T parameters). GPT-4.1 optimized for agentic workflows and tool use. Both highly literal.
703
+
704
+ #### Literal Instruction Following
705
+
706
+ GPT models execute exactly what you ask without creative interpretation.
707
+
708
+ **Best practices:**
709
+
710
+ 1. **Specify format precisely:** Show exact output structure with example.
711
+ 2. **Define boundaries explicitly:** "Exactly 3 paragraphs, 3-4 sentences each".
712
+ 3. **Use JSON mode:** Provide schema for structured output.
713
+
714
+ #### Tool Use with GPT-4.1
715
+
716
+ **Best practices:**
717
+
718
+ 1. **Define tools with precise schemas:** Include descriptions for all parameters.
719
+ 2. **Provide tool use examples:** Show correct calling in system message or few-shot.
720
+ 3. **Handle tool errors gracefully:** Clear error messages help model adjust.
721
+
722
+ ### Reasoning Models (o3, DeepSeek R1)
723
+
724
+ **Critical differences:** Zero-shot > few-shot (examples hurt), minimal prompting better, built-in reasoning (no CoT needed), higher latency, different use cases (deep reasoning not rapid generation).
725
+
726
+ #### When to Use Reasoning Models
727
+
728
+ **Use for:** Math problems/proofs, complex logical reasoning, scientific analysis, deep thought problems, multi-step solving.
729
+
730
+ **Don't use for:** Simple retrieval, content generation, code formatting, quick facts, high-throughput (use standard models).
731
+
732
+ #### Best Practices
733
+
734
+ 1. **Keep prompts simple:** ✅ "Prove square root of 2 is irrational." ❌ "Think step by step. First consider X..."
735
+ 2. **Don't provide examples:** ✅ "Solve: [PROBLEM]" ❌ "Example 1: [...] Example 2: [...] Now solve..."
736
+ 3. **Let model show reasoning:** "Solve this problem and show your reasoning: [PROBLEM]"
737
+ 4. **Trust thinking time:** 30+ seconds is normal and beneficial.
738
+
739
+ #### DeepSeek R1 vs o3
740
+
741
+ **Choose DeepSeek R1:** Budget constraints, self-hosting needs, research projects, high-volume reasoning (27x cheaper, MIT license, 128K context).
742
+
743
+ **Choose o3:** Maximum accuracy required, enterprise SLAs needed, proprietary data (can't self-host).
744
+
745
+ ### Gemini 2.5
746
+
747
+ **Key characteristics:** Excellent multimodal, competitive pricing ($1.25/$10 per M tokens), temperature-sensitive, strong benchmarks.
748
+
749
+ #### Temperature Settings
750
+
751
+ **Critical:** More sensitive than other models. **Keep at 1.0** unless specific reason to change.
752
+
753
+ **Effects:** 0.0-0.3 (very deterministic, repetitive), 0.4-0.7 (balanced), 0.8-1.0 (creative, recommended), 1.0+ (increasingly random, caution).
754
+
755
+ #### Best Practices
756
+
757
+ 1. **Default temperature 1.0**
758
+ 2. **Leverage multimodal:** Combine images + text for richer context
759
+ 3. **Cost optimization:** Excellent price/performance for high-volume
760
+ 4. **Structured output:** Use JSON mode for consistent formatting
761
+
762
+ ---
763
+
764
+ ## 6. Quick Reference
765
+
766
+ ### Validated Techniques
767
+
768
+ **Core Techniques**
769
+ - ✅ Chain-of-Thought: 80.2% vs 34% baseline
770
+ - ✅ "Take a deep breath and work step-by-step": Simple effective trigger
771
+ - ✅ Few-shot (standard models): 3-5 examples optimal
772
+ - ✅ Zero-shot (reasoning models): Required for o3, DeepSeek R1
773
+ - ✅ Self-consistency: Multiple analyses → convergent conclusions
774
+ - ✅ Tree of Thought: Multi-path exploration
775
+
776
+ **Frameworks**
777
+ - ✅ CO-STAR: Wins competitions for writing
778
+ - ✅ ROSES: Structured decision support
779
+ - ✅ ReAct: 20-30% improvement for complex tasks
780
+ - ✅ Reflexion: 91% pass@1 on HumanEval
781
+
782
+ **Software Engineering**
783
+ - ✅ Security-first two-stage: 50%+ reduction in vulnerabilities
784
+ - ✅ Test-driven development: Reduces hallucination significantly
785
+ - ✅ Architecture-first: Context → Goal → Constraints → Requirements
786
+ - ✅ Explicit instructions (Claude 4): Required for best results
787
+
788
+ **Model-Specific**
789
+ - ✅ XML tags (Claude): Improved structure parsing
790
+ - ✅ Extended thinking (Claude 4.5): 5-7% reasoning gains
791
+ - ✅ Literal formatting (GPT): Precise output control
792
+ - ✅ Temperature 1.0 (Gemini): Optimal default
793
+
794
+ ### Debunked Myths
795
+
796
+ **Don't Work**
797
+ - ❌ $200 tip prompting: No consistent effect
798
+ - ❌ "Act as an expert": Zero accuracy improvement
799
+ - ❌ Politeness ("please", "thank you"): No performance benefit
800
+ - ❌ Emotional appeals: Generally ineffective
801
+ - ❌ Few-shot for reasoning models: Actively harms performance
802
+ - ❌ Vague instructions: Claude 4 won't fill gaps
803
+ - ❌ Negative framing: Less effective than positive
804
+
805
+ **Outdated**
806
+ - ❌ GPT-3 era techniques: Modern models fundamentally different
807
+ - ❌ Excessive prompt engineering: Many 2022-2023 "tricks" don't help
808
+ - ❌ One-size-fits-all: Model-specific optimization critical
809
+
810
+ ### Common Pitfalls
811
+
812
+ **Pitfall 1: Few-Shot with Reasoning Models**
813
+ Problem: Examples reduce o3/DeepSeek R1 accuracy. Solution: Zero-shot only.
814
+
815
+ **Pitfall 2: Implicit Requirements with Claude 4**
816
+ Problem: Won't infer unstated needs. Solution: Extremely explicit about all requirements.
817
+
818
+ **Pitfall 3: Ignoring Security**
819
+ Problem: 40%+ of AI code has vulnerabilities without security prompting. Solution: Two-stage prompting.
820
+
821
+ **Pitfall 4: Critical Info in Middle**
822
+ Problem: Reduced attention in middle ("lost in the middle"). Solution: Place at beginning or end.
823
+
824
+ **Pitfall 5: Format Mismatch**
825
+ Problem: Messy prompt → messy output. Solution: Structure prompt like desired output.
826
+
827
+ **Pitfall 6: Wrong Model for Task**
828
+ Problem: Reasoning model for simple generation, or standard for deep reasoning. Solution: See Model Selection Framework.
829
+
830
+ **Pitfall 7: Insufficient Context**
831
+ Problem: Model lacks info to complete correctly. Solution: Provide code, constraints, requirements, why it matters.
832
+
833
+ ### Performance Benchmarks
834
+
835
+ **SWE-bench Verified (Code Generation)**
836
+ - Claude Sonnet 4.5: **77.2%** (82.0% high compute)
837
+ - Claude Opus 4.1: 74.5%
838
+ - Claude Haiku 4.5: 73.3%
839
+ - Claude Sonnet 4: 72.7%
840
+
841
+ **HumanEval (Code Correctness)**
842
+ - Reflexion pattern: **91% pass@1**
843
+ - Standard prompting: ~65% pass@1
844
+
845
+ **Reasoning Tasks**
846
+ - Chain-of-Thought: 80.2%
847
+ - Baseline (no CoT): 34%
848
+
849
+ **Computer Use (OSWorld)**
850
+ - Claude Sonnet 4.5: **61.4%** (45% improvement over Sonnet 4)
851
+ - Claude Sonnet 4: 42.2%
852
+
853
+ **Security Improvements**
854
+ - Security-first two-stage: **50%+ reduction** in vulnerabilities
855
+ - Input validation coverage: 82-84% (vs 16-18% without security prompting)