ai-agent-rules 0.15.2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ai_agent_rules-0.15.2.dist-info/METADATA +451 -0
- ai_agent_rules-0.15.2.dist-info/RECORD +52 -0
- ai_agent_rules-0.15.2.dist-info/WHEEL +5 -0
- ai_agent_rules-0.15.2.dist-info/entry_points.txt +3 -0
- ai_agent_rules-0.15.2.dist-info/licenses/LICENSE +22 -0
- ai_agent_rules-0.15.2.dist-info/top_level.txt +1 -0
- ai_rules/__init__.py +8 -0
- ai_rules/agents/__init__.py +1 -0
- ai_rules/agents/base.py +68 -0
- ai_rules/agents/claude.py +123 -0
- ai_rules/agents/cursor.py +70 -0
- ai_rules/agents/goose.py +47 -0
- ai_rules/agents/shared.py +35 -0
- ai_rules/bootstrap/__init__.py +75 -0
- ai_rules/bootstrap/config.py +261 -0
- ai_rules/bootstrap/installer.py +279 -0
- ai_rules/bootstrap/updater.py +344 -0
- ai_rules/bootstrap/version.py +52 -0
- ai_rules/cli.py +2434 -0
- ai_rules/completions.py +194 -0
- ai_rules/config/AGENTS.md +249 -0
- ai_rules/config/chat_agent_hints.md +1 -0
- ai_rules/config/claude/CLAUDE.md +1 -0
- ai_rules/config/claude/agents/code-reviewer.md +121 -0
- ai_rules/config/claude/commands/agents-md.md +422 -0
- ai_rules/config/claude/commands/annotate-changelog.md +191 -0
- ai_rules/config/claude/commands/comment-cleanup.md +161 -0
- ai_rules/config/claude/commands/continue-crash.md +38 -0
- ai_rules/config/claude/commands/dev-docs.md +169 -0
- ai_rules/config/claude/commands/pr-creator.md +247 -0
- ai_rules/config/claude/commands/test-cleanup.md +244 -0
- ai_rules/config/claude/commands/update-docs.md +324 -0
- ai_rules/config/claude/hooks/subagentStop.py +92 -0
- ai_rules/config/claude/mcps.json +1 -0
- ai_rules/config/claude/settings.json +119 -0
- ai_rules/config/claude/skills/doc-writer/SKILL.md +293 -0
- ai_rules/config/claude/skills/doc-writer/resources/templates.md +495 -0
- ai_rules/config/claude/skills/prompt-engineer/SKILL.md +272 -0
- ai_rules/config/claude/skills/prompt-engineer/resources/prompt_engineering_guide_2025.md +855 -0
- ai_rules/config/claude/skills/prompt-engineer/resources/templates.md +232 -0
- ai_rules/config/cursor/keybindings.json +14 -0
- ai_rules/config/cursor/settings.json +81 -0
- ai_rules/config/goose/.goosehints +1 -0
- ai_rules/config/goose/config.yaml +55 -0
- ai_rules/config/profiles/default.yaml +6 -0
- ai_rules/config/profiles/work.yaml +11 -0
- ai_rules/config.py +644 -0
- ai_rules/display.py +40 -0
- ai_rules/mcp.py +369 -0
- ai_rules/profiles.py +187 -0
- ai_rules/symlinks.py +207 -0
- ai_rules/utils.py +35 -0
|
@@ -0,0 +1,855 @@
|
|
|
1
|
+
# LLM Prompt Engineering Reference Guide (November 2025)
|
|
2
|
+
|
|
3
|
+
Comprehensive guide to effective prompting techniques for modern large language models based on current research and validated best practices.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Table of Contents
|
|
8
|
+
|
|
9
|
+
1. [Introduction & Current Models](#1-introduction--current-models)
|
|
10
|
+
- [Current Model Landscape](#current-model-landscape)
|
|
11
|
+
- [Model Selection Framework](#model-selection-framework)
|
|
12
|
+
|
|
13
|
+
2. [Core Prompting Techniques](#2-core-prompting-techniques)
|
|
14
|
+
- [Reasoning & Analysis](#reasoning--analysis)
|
|
15
|
+
- [Structured Output Frameworks](#structured-output-frameworks)
|
|
16
|
+
- [Learning Approaches](#learning-approaches)
|
|
17
|
+
|
|
18
|
+
3. [Software Engineering Prompts](#3-software-engineering-prompts)
|
|
19
|
+
- [Development Workflows](#development-workflows)
|
|
20
|
+
- [Code Generation Patterns](#code-generation-patterns)
|
|
21
|
+
- [Debugging Strategies](#debugging-strategies)
|
|
22
|
+
|
|
23
|
+
4. [Advanced Techniques](#4-advanced-techniques)
|
|
24
|
+
- [Agentic Patterns](#agentic-patterns)
|
|
25
|
+
- [RAG & Knowledge Integration](#rag--knowledge-integration)
|
|
26
|
+
- [Multi-Path Analysis](#multi-path-analysis)
|
|
27
|
+
|
|
28
|
+
5. [Model-Specific Optimizations](#5-model-specific-optimizations)
|
|
29
|
+
- [Claude 4.5 Series](#claude-45-series)
|
|
30
|
+
- [GPT-5 and GPT-4.1](#gpt-5-and-gpt-41)
|
|
31
|
+
- [Reasoning Models (o3, DeepSeek R1)](#reasoning-models-o3-deepseek-r1)
|
|
32
|
+
- [Gemini 2.5](#gemini-25)
|
|
33
|
+
|
|
34
|
+
6. [Quick Reference](#6-quick-reference)
|
|
35
|
+
- [Validated Techniques](#validated-techniques)
|
|
36
|
+
- [Debunked Myths](#debunked-myths)
|
|
37
|
+
- [Common Pitfalls](#common-pitfalls)
|
|
38
|
+
- [Performance Benchmarks](#performance-benchmarks)
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## 1. Introduction & Current Models
|
|
43
|
+
|
|
44
|
+
### Current Model Landscape
|
|
45
|
+
|
|
46
|
+
| Model | Version | Best For | Context Window | Key Feature |
|
|
47
|
+
|-------|---------|----------|----------------|-------------|
|
|
48
|
+
| **Claude Sonnet 4.5** | Sept 2025 | Default choice, coding, agents | 200K (1M beta) | Extended thinking, 77.2% SWE-bench |
|
|
49
|
+
| **Claude Haiku 4.5** | Oct 2025 | Speed, cost optimization | 200K | 2-5x faster, frontier performance |
|
|
50
|
+
| **Claude Opus 4.1** | Aug 2025 | Maximum capability | 200K | Best for complex refactoring |
|
|
51
|
+
| **GPT-5** | Oct 2025 | General purpose | Large | 2 trillion parameters |
|
|
52
|
+
| **GPT-4.1** | 2025 | Agentic tasks | Large | Optimized for tool use |
|
|
53
|
+
| **o3 / o3-mini** | 2025 | Deep reasoning | N/A | Specialized reasoning model |
|
|
54
|
+
| **DeepSeek R1** | Jan 2025 | Cost-effective reasoning | 128K | 27x cheaper than o3, open source |
|
|
55
|
+
| **Gemini 2.5 Pro** | 2025 | Multimodal | Large | Best pricing ($1.25/$10 per M tokens) |
|
|
56
|
+
|
|
57
|
+
### Model Selection Framework
|
|
58
|
+
|
|
59
|
+
**Claude Sonnet 4.5:** Default for most tasks, coding, autonomous agents (30+ hour focus), computer use. Best capability/cost balance.
|
|
60
|
+
|
|
61
|
+
**Claude Haiku 4.5:** Speed-critical (2-5x faster), high-volume processing, sub-agents, cost optimization. Frontier performance at 1/3 price.
|
|
62
|
+
|
|
63
|
+
**Claude Opus 4.1:** Maximum capability needs, complex multi-file refactoring, intricate multi-agent frameworks when cost isn't primary concern.
|
|
64
|
+
|
|
65
|
+
**GPT-5:** Broad general knowledge, non-coding tasks, massive parameter count benefits.
|
|
66
|
+
|
|
67
|
+
**o3 or DeepSeek R1:** Deep reasoning, mathematical/logical proofs, scientific analysis. DeepSeek for budget constraints (27x cheaper).
|
|
68
|
+
|
|
69
|
+
**Gemini 2.5 Pro:** Multimodal inputs, cost optimization at high volumes, competitive pricing with strong performance.
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## 2. Core Prompting Techniques
|
|
74
|
+
|
|
75
|
+
### Reasoning & Analysis
|
|
76
|
+
|
|
77
|
+
#### Chain-of-Thought (CoT) Prompting
|
|
78
|
+
|
|
79
|
+
**What it is:** Instructs the model to show reasoning step-by-step before providing a final answer.
|
|
80
|
+
|
|
81
|
+
**When to use:** Complex problems requiring logical reasoning, mathematical calculations, multi-step analysis, debugging.
|
|
82
|
+
|
|
83
|
+
**How to use:**
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
Solve this problem step by step:
|
|
87
|
+
A company's revenue was $500K in Q1. In Q2, it increased by 15%.
|
|
88
|
+
In Q3, it decreased by 10% from Q2. What was the Q3 revenue?
|
|
89
|
+
|
|
90
|
+
Show your work:
|
|
91
|
+
1. Calculate Q2 revenue
|
|
92
|
+
2. Calculate Q3 revenue
|
|
93
|
+
3. Provide the final answer
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Why it works:** Forces sequential processing, reducing errors. Research: 80.2% accuracy vs 34% baseline. "Take a deep breath and work step-by-step" is validated to improve performance.
|
|
97
|
+
|
|
98
|
+
**Advanced variants:**
|
|
99
|
+
- **Chain-of-Table:** Tabular data analysis, 8.69% improvement over standard CoT
|
|
100
|
+
- **Plan-and-Solve:** Generates plan first, then executes, 7-27% gains
|
|
101
|
+
|
|
102
|
+
#### Tree of Thought (ToT)
|
|
103
|
+
|
|
104
|
+
**What it is:** Explores multiple reasoning paths simultaneously, then evaluates which leads to the best solution.
|
|
105
|
+
|
|
106
|
+
**When to use:** Decision-making with multiple options, strategic planning, trade-off analysis.
|
|
107
|
+
|
|
108
|
+
**How to use:**
|
|
109
|
+
|
|
110
|
+
```
|
|
111
|
+
Analyze this decision using Tree of Thought:
|
|
112
|
+
|
|
113
|
+
Decision: Should we migrate our database from PostgreSQL to MongoDB?
|
|
114
|
+
|
|
115
|
+
Explore three paths:
|
|
116
|
+
Path 1: Migrate fully to MongoDB
|
|
117
|
+
- List all advantages
|
|
118
|
+
- List all disadvantages
|
|
119
|
+
- Estimate effort and risk
|
|
120
|
+
|
|
121
|
+
Path 2: Keep PostgreSQL
|
|
122
|
+
- List reasons to stay
|
|
123
|
+
- List current pain points
|
|
124
|
+
- Estimate opportunity cost
|
|
125
|
+
|
|
126
|
+
Path 3: Hybrid approach (both databases)
|
|
127
|
+
- Describe the hybrid architecture
|
|
128
|
+
- List pros and cons
|
|
129
|
+
- Estimate complexity
|
|
130
|
+
|
|
131
|
+
After exploring all three paths, evaluate which approach is best and why.
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Why it works:** Prevents premature optimization and tunnel vision through systematic exploration.
|
|
135
|
+
|
|
136
|
+
#### Self-Consistency for Reliability
|
|
137
|
+
|
|
138
|
+
**What it is:** Generate multiple independent analyses, then identify themes appearing consistently across all.
|
|
139
|
+
|
|
140
|
+
**When to use:** High-stakes decisions, complex problems with uncertainty, validating outputs, reducing single-path errors.
|
|
141
|
+
|
|
142
|
+
**How to use:**
|
|
143
|
+
|
|
144
|
+
```
|
|
145
|
+
Generate 3 different analyses of employee retention challenges in tech companies:
|
|
146
|
+
|
|
147
|
+
Analysis 1: From an organizational psychology perspective
|
|
148
|
+
Analysis 2: From a labor economics perspective
|
|
149
|
+
Analysis 3: From an HR management perspective
|
|
150
|
+
|
|
151
|
+
After completing all three, identify convergent themes and recommendations.
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
**Why it works:** Multiple reasoning chains converging on same conclusions catch model uncertainty and single-path errors.
|
|
155
|
+
|
|
156
|
+
#### Extended Thinking (Claude-Specific)
|
|
157
|
+
|
|
158
|
+
**What it is:** Claude 4.5 performs additional reasoning before response generation, visible to you.
|
|
159
|
+
|
|
160
|
+
**When to use:** Complex coding, deep analysis, multi-step logical reasoning, problems requiring careful consideration.
|
|
161
|
+
|
|
162
|
+
**How to invoke:**
|
|
163
|
+
|
|
164
|
+
Via API:
|
|
165
|
+
```json
|
|
166
|
+
{
|
|
167
|
+
"model": "claude-sonnet-4-5-20250929",
|
|
168
|
+
"max_tokens": 16000,
|
|
169
|
+
"thinking": {
|
|
170
|
+
"type": "enabled",
|
|
171
|
+
"budget_tokens": 10000
|
|
172
|
+
},
|
|
173
|
+
"messages": [...]
|
|
174
|
+
}
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
Via Claude.ai: Toggle "Extended thinking" mode (requires Pro/Max/Team/Enterprise)
|
|
178
|
+
|
|
179
|
+
**Why it works:** Dedicated computational budget for reasoning achieves 5-7% performance gains. Thinking happens BEFORE response.
|
|
180
|
+
|
|
181
|
+
**Key notes:** Minimum 1,024 tokens, charged at output rates, supports tool use (beta).
|
|
182
|
+
|
|
183
|
+
### Structured Output Frameworks
|
|
184
|
+
|
|
185
|
+
#### CO-STAR Framework for Writing
|
|
186
|
+
|
|
187
|
+
**Components:** **C**ontext, **O**bjective, **S**tyle, **T**one, **A**udience, **R**esponse format
|
|
188
|
+
|
|
189
|
+
**When to use:** Marketing copy, technical documentation, business communications, content creation.
|
|
190
|
+
|
|
191
|
+
**How to use:**
|
|
192
|
+
|
|
193
|
+
```
|
|
194
|
+
Context: Launching new API feature for real-time webhook notifications for payment events.
|
|
195
|
+
|
|
196
|
+
Objective: Write product announcement blog post generating excitement and driving adoption.
|
|
197
|
+
|
|
198
|
+
Style: Technical but accessible, conversational yet professional
|
|
199
|
+
|
|
200
|
+
Tone: Enthusiastic and developer-friendly
|
|
201
|
+
|
|
202
|
+
Audience: Software developers and engineering teams integrating with our payment API
|
|
203
|
+
|
|
204
|
+
Response format:
|
|
205
|
+
- Catchy headline
|
|
206
|
+
- Opening paragraph (2-3 sentences)
|
|
207
|
+
- "What's New" section with technical details
|
|
208
|
+
- "Why This Matters" section
|
|
209
|
+
- "Getting Started" section with code example
|
|
210
|
+
- Call-to-action
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
**Why it works:** Won prompt engineering competitions. Ensures all critical dimensions are specified.
|
|
214
|
+
|
|
215
|
+
#### ROSES Framework for Decisions
|
|
216
|
+
|
|
217
|
+
**Components:** **R**ole, **O**bjective, **S**cenario, **E**xpected Output, **S**tyle
|
|
218
|
+
|
|
219
|
+
**When to use:** Strategic business decisions, technical architecture choices, resource allocation, risk assessment.
|
|
220
|
+
|
|
221
|
+
**How to use:**
|
|
222
|
+
|
|
223
|
+
```
|
|
224
|
+
Role: Act as a CTO evaluating infrastructure decisions
|
|
225
|
+
|
|
226
|
+
Objective: Decide whether to adopt Kubernetes for our microservices architecture
|
|
227
|
+
|
|
228
|
+
Scenario:
|
|
229
|
+
- Currently running 12 microservices on EC2 with manual deployment
|
|
230
|
+
- Team of 8 engineers, 3 have container experience
|
|
231
|
+
- Growing 30% quarter-over-quarter
|
|
232
|
+
- Need to improve deployment speed and reliability
|
|
233
|
+
|
|
234
|
+
Expected Output:
|
|
235
|
+
- Recommendation (Yes/No/Phased)
|
|
236
|
+
- 3-5 key decision factors
|
|
237
|
+
- Risk assessment
|
|
238
|
+
- Implementation timeline if recommended
|
|
239
|
+
|
|
240
|
+
Style: Data-driven, practical, focused on team capabilities and business impact
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
**Why it works:** Structures complex decisions with clear boundaries, producing actionable recommendations vs generic analysis.
|
|
244
|
+
|
|
245
|
+
### Learning Approaches
|
|
246
|
+
|
|
247
|
+
#### Few-Shot Learning
|
|
248
|
+
|
|
249
|
+
**What it is:** Providing 3-5 examples of desired input-output pattern before asking model to perform task.
|
|
250
|
+
|
|
251
|
+
**When to use:** Custom output formats, specific writing styles, pattern matching, domain conventions. **Standard models ONLY** (NOT reasoning models).
|
|
252
|
+
|
|
253
|
+
**How to use:**
|
|
254
|
+
|
|
255
|
+
```
|
|
256
|
+
Extract key information from product reviews into structured format.
|
|
257
|
+
|
|
258
|
+
Example 1:
|
|
259
|
+
Input: "Great laptop, fast processor but battery life could be better. Worth the price."
|
|
260
|
+
Output: {sentiment: "positive", pros: ["fast processor", "good value"], cons: ["battery life"], rating_implied: 4}
|
|
261
|
+
|
|
262
|
+
Example 2:
|
|
263
|
+
Input: "Terrible build quality. Broke after 2 weeks. Don't waste your money."
|
|
264
|
+
Output: {sentiment: "negative", pros: [], cons: ["build quality", "durability"], rating_implied: 1}
|
|
265
|
+
|
|
266
|
+
Now extract information from this review:
|
|
267
|
+
"Amazing screen quality and super lightweight. The speakers are weak though. Great for travel."
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
**Why it works:** Shows model exact pattern, reducing ambiguity. Effective with standard models (GPT-4, Claude Sonnet, Gemini).
|
|
271
|
+
|
|
272
|
+
**CRITICAL WARNING:** Few-shot **harms** reasoning models (o1, o3, DeepSeek R1). Use zero-shot for these models.
|
|
273
|
+
|
|
274
|
+
#### Zero-Shot Prompting
|
|
275
|
+
|
|
276
|
+
**What it is:** Task instructions without examples, relying on model's pre-training.
|
|
277
|
+
|
|
278
|
+
**When to use:** Reasoning models (o1, o3, DeepSeek R1) - **REQUIRED**, simple common tasks, when examples might bias output.
|
|
279
|
+
|
|
280
|
+
**How to use:**
|
|
281
|
+
|
|
282
|
+
```
|
|
283
|
+
Analyze this code for potential security vulnerabilities:
|
|
284
|
+
|
|
285
|
+
[CODE HERE]
|
|
286
|
+
|
|
287
|
+
For each vulnerability found:
|
|
288
|
+
- Describe the issue
|
|
289
|
+
- Explain the potential impact
|
|
290
|
+
- Provide a secure alternative
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
**Why it works:** Adding examples to reasoning models interferes with their internal reasoning process.
|
|
294
|
+
|
|
295
|
+
**Best practice:** Start zero-shot. Add few-shot only if: (1) NOT using reasoning model, (2) output quality insufficient, (3) need very specific format.
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## 3. Software Engineering Prompts
|
|
300
|
+
|
|
301
|
+
### Development Workflows
|
|
302
|
+
|
|
303
|
+
#### Architecture-First Prompting
|
|
304
|
+
|
|
305
|
+
**Pattern:** Context → Goal → Constraints → Technical Requirements
|
|
306
|
+
|
|
307
|
+
**When to use:** New features, refactoring, integrating with existing codebases, complex implementations.
|
|
308
|
+
|
|
309
|
+
**How to use:**
|
|
310
|
+
|
|
311
|
+
```
|
|
312
|
+
Context: Express.js API with PostgreSQL. JWT tokens stored in PostgreSQL sessions table.
|
|
313
|
+
|
|
314
|
+
Goal: Implement rate limiting to prevent API abuse on authentication endpoints.
|
|
315
|
+
|
|
316
|
+
Constraints:
|
|
317
|
+
- Must work with existing JWT auth system
|
|
318
|
+
- Should not add latency >10ms
|
|
319
|
+
- Scale across multiple API server instances
|
|
320
|
+
- Cannot require additional database queries per request
|
|
321
|
+
|
|
322
|
+
Technical Requirements:
|
|
323
|
+
- Use Redis for distributed rate limit tracking
|
|
324
|
+
- Implement sliding window algorithm
|
|
325
|
+
- Different limits for authenticated vs unauthenticated
|
|
326
|
+
- Configurable limits per endpoint
|
|
327
|
+
|
|
328
|
+
Provide architecture design first, then implement core rate limiting middleware.
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
**Why it works:** Prevents code solving wrong problem or violating constraints. Establishes clear boundaries upfront.
|
|
332
|
+
|
|
333
|
+
#### Security-First Two-Stage Prompting
|
|
334
|
+
|
|
335
|
+
**What it is:** Generate functional code first, then explicitly harden for security.
|
|
336
|
+
|
|
337
|
+
**When to use:** Code handling user input, authentication/authorization, payments, database queries, file operations, API integrations.
|
|
338
|
+
|
|
339
|
+
**How to use:**
|
|
340
|
+
|
|
341
|
+
Stage 1:
|
|
342
|
+
```
|
|
343
|
+
Implement user registration endpoint:
|
|
344
|
+
- Accept email, password, username
|
|
345
|
+
- Validate email format
|
|
346
|
+
- Hash password with bcrypt
|
|
347
|
+
- Store in PostgreSQL users table
|
|
348
|
+
- Return success/error response
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
Stage 2:
|
|
352
|
+
```
|
|
353
|
+
Review the registration endpoint for security vulnerabilities:
|
|
354
|
+
|
|
355
|
+
[PASTE CODE FROM STAGE 1]
|
|
356
|
+
|
|
357
|
+
Harden against: SQL injection, email injection, password policy (min 12 chars, complexity),
|
|
358
|
+
rate limiting, input sanitization, error message information disclosure.
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
**Why it works:** 40%+ of AI code has vulnerabilities without security prompting. Two-stage reduces by 50%+. Catches missing input validation (16-18% of code) and hardcoded credentials.
|
|
362
|
+
|
|
363
|
+
#### Test-Driven Development with AI
|
|
364
|
+
|
|
365
|
+
**What it is:** Write comprehensive test cases first, then AI implements code until tests pass.
|
|
366
|
+
|
|
367
|
+
**When to use:** Critical business logic, complex algorithms, reducing hallucination, ensuring correctness, refactoring.
|
|
368
|
+
|
|
369
|
+
**How to use:**
|
|
370
|
+
|
|
371
|
+
```
|
|
372
|
+
Write comprehensive test cases for a function validating credit card numbers using Luhn algorithm.
|
|
373
|
+
|
|
374
|
+
Cover: Valid cards (Visa, MasterCard, Amex), invalid cards (wrong checksum), edge cases
|
|
375
|
+
(all zeros, all nines, single digit), invalid input (non-numeric, null, undefined, empty),
|
|
376
|
+
length validation (too short, too long).
|
|
377
|
+
|
|
378
|
+
First provide test suite, then implement function making all tests pass.
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
**Why it works:** Tests act as formal specification, dramatically reducing hallucination with concrete pass/fail criteria.
|
|
382
|
+
|
|
383
|
+
### Code Generation Patterns
|
|
384
|
+
|
|
385
|
+
#### Explicit Instruction Following (Claude 4)
|
|
386
|
+
|
|
387
|
+
**Why this matters:** Claude 4 follows instructions precisely but won't infer unstated requirements or add features not requested.
|
|
388
|
+
|
|
389
|
+
**How to adapt:**
|
|
390
|
+
|
|
391
|
+
❌ Too implicit: `Create a user profile component`
|
|
392
|
+
|
|
393
|
+
✅ Explicit:
|
|
394
|
+
```
|
|
395
|
+
Create a React user profile component with:
|
|
396
|
+
- Props: userId (string), onEdit (callback), readOnly (boolean)
|
|
397
|
+
- Display: avatar image, username, email, bio
|
|
398
|
+
- Edit button (only shown when readOnly=false)
|
|
399
|
+
- Loading state while fetching user data
|
|
400
|
+
- Error state with retry button if fetch fails
|
|
401
|
+
- Use Tailwind for styling
|
|
402
|
+
- TypeScript with proper type definitions
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
**Key principles:** State all requirements, specify error handling, define expected behaviors, list edge cases, provide context about WHY.
|
|
406
|
+
|
|
407
|
+
**Why it works:** Claude 4's architecture prioritizes instruction-following over inference.
|
|
408
|
+
|
|
409
|
+
#### Iterative Refinement Pattern
|
|
410
|
+
|
|
411
|
+
**What it is:** Generate initial code, then iteratively improve specific aspects in separate prompts.
|
|
412
|
+
|
|
413
|
+
**When to use:** Complex implementations, uncertain requirements, performance optimization, code review.
|
|
414
|
+
|
|
415
|
+
**Pattern:**
|
|
416
|
+
1. Basic implementation
|
|
417
|
+
2. Add features
|
|
418
|
+
3. Optimize
|
|
419
|
+
4. Production-ready (error handling, logging, types, docs)
|
|
420
|
+
|
|
421
|
+
**Why it works:** Breaks complexity into manageable pieces. Each iteration focuses on specific aspects. Validate direction before adding complexity.
|
|
422
|
+
|
|
423
|
+
### Debugging Strategies
|
|
424
|
+
|
|
425
|
+
#### Structured Debugging Pattern
|
|
426
|
+
|
|
427
|
+
**Pattern:** Error Message → Stack Trace → Context → Expected Behavior
|
|
428
|
+
|
|
429
|
+
**How to use:**
|
|
430
|
+
|
|
431
|
+
```
|
|
432
|
+
Error Message:
|
|
433
|
+
TypeError: Cannot read property 'map' of undefined at UserList.render (UserList.jsx:23)
|
|
434
|
+
|
|
435
|
+
Stack Trace:
|
|
436
|
+
at UserList.render (UserList.jsx:23:15)
|
|
437
|
+
at renderComponent (react-dom.js:1847)
|
|
438
|
+
|
|
439
|
+
Context:
|
|
440
|
+
- React component displaying user list
|
|
441
|
+
- Data from API endpoint /api/users
|
|
442
|
+
- Renders correctly on initial load
|
|
443
|
+
- Error on "Refresh" button click
|
|
444
|
+
- API call succeeds (200 with valid JSON)
|
|
445
|
+
|
|
446
|
+
Code: [PASTE RELEVANT CODE]
|
|
447
|
+
|
|
448
|
+
Expected: Refresh button fetches fresh data and re-renders without errors.
|
|
449
|
+
|
|
450
|
+
What's causing this and how do I fix it?
|
|
451
|
+
```
|
|
452
|
+
|
|
453
|
+
**Why it works:** Complete context enables root cause analysis vs guessing.
|
|
454
|
+
|
|
455
|
+
#### First Principles Debugging
|
|
456
|
+
|
|
457
|
+
**What it is:** Ask model to explain root cause from first principles rather than jumping to solutions.
|
|
458
|
+
|
|
459
|
+
**When to use:** Persistent bugs, unexpected behavior in complex systems, learning opportunities.
|
|
460
|
+
|
|
461
|
+
**How to use:**
|
|
462
|
+
|
|
463
|
+
```
|
|
464
|
+
PostgreSQL query extremely slow (5+ seconds) despite index on queried column.
|
|
465
|
+
|
|
466
|
+
Query: SELECT * FROM orders WHERE user_id = 123 AND created_at > '2025-01-01'
|
|
467
|
+
Index: CREATE INDEX idx_orders_user_id ON orders(user_id)
|
|
468
|
+
Table size: 10 million rows
|
|
469
|
+
|
|
470
|
+
Explain from first principles:
|
|
471
|
+
1. How PostgreSQL should be using the index
|
|
472
|
+
2. Why the index might not be helping
|
|
473
|
+
3. What's actually happening during query execution
|
|
474
|
+
4. The root cause of the slowness
|
|
475
|
+
|
|
476
|
+
Then suggest fix with explanation of why it works.
|
|
477
|
+
```
|
|
478
|
+
|
|
479
|
+
**Why it works:** Forces understanding vs pattern matching. Produces learning and prevents recurrence.
|
|
480
|
+
|
|
481
|
+
---
|
|
482
|
+
|
|
483
|
+
## 4. Advanced Techniques
|
|
484
|
+
|
|
485
|
+
### Agentic Patterns
|
|
486
|
+
|
|
487
|
+
#### ReAct Pattern (Reason + Act)
|
|
488
|
+
|
|
489
|
+
**Pattern:** Thought → Action → Observation → Thought → Action → ...
|
|
490
|
+
|
|
491
|
+
**When to use:** Multi-step tasks with tools, research/information gathering, complex problem-solving, autonomous agents.
|
|
492
|
+
|
|
493
|
+
**How to use:**
|
|
494
|
+
|
|
495
|
+
```
|
|
496
|
+
You are an agent using tools to answer questions. Follow ReAct pattern:
|
|
497
|
+
|
|
498
|
+
Thought: [Reason about what you need to do next]
|
|
499
|
+
Action: [Tool to use and how]
|
|
500
|
+
Observation: [Result from tool]
|
|
501
|
+
[Repeat until final answer]
|
|
502
|
+
|
|
503
|
+
Available tools:
|
|
504
|
+
- search(query): Search web
|
|
505
|
+
- calculate(expression): Evaluate math
|
|
506
|
+
- fetch_url(url): Get content from URL
|
|
507
|
+
|
|
508
|
+
Question: What was the GDP of the country that hosted the 2020 Olympics in the year they hosted?
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
**Why it works:** 20-30% improvement over direct prompting. Explicit reasoning loop prevents actions without thinking.
|
|
512
|
+
|
|
513
|
+
**Tips:** Define tools clearly, require "Thought" before "Action", allow multiple iterations, parse observations into loop.
|
|
514
|
+
|
|
515
|
+
#### Reflexion Pattern
|
|
516
|
+
|
|
517
|
+
**Pattern:** Attempt → Evaluate → Reflect → Retry
|
|
518
|
+
|
|
519
|
+
**When to use:** Optimization tasks, learning from errors, iterative improvement.
|
|
520
|
+
|
|
521
|
+
**How to use:**
|
|
522
|
+
|
|
523
|
+
```
|
|
524
|
+
Solve this using Reflexion pattern:
|
|
525
|
+
|
|
526
|
+
Problem: Implement function finding longest palindromic substring.
|
|
527
|
+
|
|
528
|
+
Attempt 1: [Implementation]
|
|
529
|
+
Evaluation: Test with "babad", "cbbd", "a", "ac"
|
|
530
|
+
Reflection: What worked? Failed? Why? Try differently?
|
|
531
|
+
Attempt 2: [Improved implementation]
|
|
532
|
+
[Repeat until tests pass]
|
|
533
|
+
```
|
|
534
|
+
|
|
535
|
+
**Why it works:** 91% pass@1 on HumanEval. Self-reflection catches initial errors.
|
|
536
|
+
|
|
537
|
+
#### Multi-Agent Patterns
|
|
538
|
+
|
|
539
|
+
**Common patterns:** Hierarchical (manager delegates), Sequential (pipeline), Collaborative (multiple inputs synthesized).
|
|
540
|
+
|
|
541
|
+
**When to use:** Very complex tasks, quality validation, parallel exploration, large-scale projects.
|
|
542
|
+
|
|
543
|
+
**Example - Collaborative:**
|
|
544
|
+
|
|
545
|
+
```
|
|
546
|
+
Design new API endpoint for payment processing.
|
|
547
|
+
|
|
548
|
+
Agent 1 (API Designer): Design endpoint spec (method, path, schemas, status codes, auth)
|
|
549
|
+
Agent 2 (Security Reviewer): Review for auth vulnerabilities, authorization, input validation, info disclosure
|
|
550
|
+
Agent 3 (Performance Engineer): Analyze latency, query efficiency, caching, scalability
|
|
551
|
+
Final Synthesizer: Combine insights into final design balancing all concerns.
|
|
552
|
+
```
|
|
553
|
+
|
|
554
|
+
**Why it works:** Specialization enables deep focus. Multiple perspectives catch more issues. Higher quality but higher token cost.
|
|
555
|
+
|
|
556
|
+
### RAG & Knowledge Integration
|
|
557
|
+
|
|
558
|
+
#### Preventing Hallucination with RAG
|
|
559
|
+
|
|
560
|
+
**What it is:** Provide specific documents and instruct model to answer ONLY from that information.
|
|
561
|
+
|
|
562
|
+
**When to use:** Questions about specific documents, fact-based responses with citations, domain knowledge, preventing confabulation.
|
|
563
|
+
|
|
564
|
+
**How to use:**
|
|
565
|
+
|
|
566
|
+
```
|
|
567
|
+
Based ONLY on the following documents, answer the question.
|
|
568
|
+
If answer not in documents, say "I cannot answer based on provided documents."
|
|
569
|
+
|
|
570
|
+
Document 1: [CONTENT]
|
|
571
|
+
Document 2: [CONTENT]
|
|
572
|
+
Document 3: [CONTENT]
|
|
573
|
+
|
|
574
|
+
Question: [USER QUESTION]
|
|
575
|
+
|
|
576
|
+
Answer based only on documents above. Cite which document(s) used.
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
**Why it works:** "Based ONLY on" prevents parametric knowledge use, dramatically reducing hallucination.
|
|
580
|
+
|
|
581
|
+
**Best practices:** Use "ONLY"/"exclusively", explicitly instruct to say when doesn't know, request citations, keep documents focused (quality > quantity).
|
|
582
|
+
|
|
583
|
+
#### Context Window Optimization
|
|
584
|
+
|
|
585
|
+
**Key finding:** Models pay most attention to beginning and end ("lost in the middle" problem).
|
|
586
|
+
|
|
587
|
+
**Best practices:**
|
|
588
|
+
|
|
589
|
+
1. **Critical information positioning:** Most important at start, supporting in middle, restate key context with query at end.
|
|
590
|
+
|
|
591
|
+
2. **Structured with markers:**
|
|
592
|
+
```
|
|
593
|
+
<critical_information>[Most important]</critical_information>
|
|
594
|
+
<background>[Additional context]</background>
|
|
595
|
+
<task>[What you want done]</task>
|
|
596
|
+
```
|
|
597
|
+
|
|
598
|
+
3. **Relevance ordering:** Most relevant first, least relevant middle, second-most relevant end.
|
|
599
|
+
|
|
600
|
+
**Why it works:** Combats attention decay across long contexts.
|
|
601
|
+
|
|
602
|
+
### Multi-Path Analysis
|
|
603
|
+
|
|
604
|
+
#### Ensemble Methods
|
|
605
|
+
|
|
606
|
+
**What it is:** Generate multiple independent solutions and synthesize best answer.
|
|
607
|
+
|
|
608
|
+
**When to use:** High-stakes decisions, complex problems with multiple approaches, quality assurance, creative tasks.
|
|
609
|
+
|
|
610
|
+
**How to use:**
|
|
611
|
+
|
|
612
|
+
```
|
|
613
|
+
Generate 5 different database schemas for social media app with users, posts, comments, likes.
|
|
614
|
+
|
|
615
|
+
Solution 1-5: [GENERATE SCHEMAS]
|
|
616
|
+
|
|
617
|
+
Analyze all 5:
|
|
618
|
+
- Strengths of each?
|
|
619
|
+
- Trade-offs?
|
|
620
|
+
- Which aspects should be combined?
|
|
621
|
+
|
|
622
|
+
Provide final synthesized solution combining best elements.
|
|
623
|
+
```
|
|
624
|
+
|
|
625
|
+
**Why it works:** Explores solution space thoroughly. Synthesis catches individual weaknesses. Higher quality than single-shot.
|
|
626
|
+
|
|
627
|
+
#### Debate Pattern
|
|
628
|
+
|
|
629
|
+
**What it is:** Model argues different positions before reaching conclusion.
|
|
630
|
+
|
|
631
|
+
**When to use:** Controversial decisions, trade-off analysis, surfacing hidden assumptions, avoiding bias.
|
|
632
|
+
|
|
633
|
+
**How to use:**
|
|
634
|
+
|
|
635
|
+
```
|
|
636
|
+
Debate: "Should we use microservices or monolith?"
|
|
637
|
+
|
|
638
|
+
Round 1: Both advocates make strongest cases
|
|
639
|
+
Round 2: Both rebut opposing arguments
|
|
640
|
+
Round 3: Identify common ground and key trade-offs
|
|
641
|
+
Final synthesis: Nuanced recommendation based on debate
|
|
642
|
+
```
|
|
643
|
+
|
|
644
|
+
**Why it works:** Forces multiple perspectives. Reveals assumptions and argument weaknesses. More balanced conclusions.
|
|
645
|
+
|
|
646
|
+
---
|
|
647
|
+
|
|
648
|
+
## 5. Model-Specific Optimizations
|
|
649
|
+
|
|
650
|
+
### Claude 4.5 Series
|
|
651
|
+
|
|
652
|
+
**Key characteristics:** Requires explicit prompting. No "above and beyond" behavior. Won't add unstated features. Requires explicit error handling. Needs context about WHY. Responds better to positive framing.
|
|
653
|
+
|
|
654
|
+
#### XML Tags for Structure
|
|
655
|
+
|
|
656
|
+
**When to use:** Complex prompts with multiple sections, clear separation of examples/instructions/context, nested hierarchies.
|
|
657
|
+
|
|
658
|
+
**How to use:**
|
|
659
|
+
|
|
660
|
+
```
|
|
661
|
+
<context>
|
|
662
|
+
REST API for e-commerce. JWT authentication. PostgreSQL storage. RESTful conventions.
|
|
663
|
+
</context>
|
|
664
|
+
|
|
665
|
+
<requirements>
|
|
666
|
+
Create endpoint for updating product inventory:
|
|
667
|
+
- PUT /products/{productId}/inventory
|
|
668
|
+
- Requires authentication
|
|
669
|
+
- Accepts quantity (integer) in body
|
|
670
|
+
- Validates quantity non-negative
|
|
671
|
+
- Returns updated product with new inventory
|
|
672
|
+
</requirements>
|
|
673
|
+
|
|
674
|
+
<constraints>
|
|
675
|
+
- Use existing auth middleware
|
|
676
|
+
- Validate user has "inventory_manager" role
|
|
677
|
+
- Use database transactions
|
|
678
|
+
- Log inventory changes for audit
|
|
679
|
+
</constraints>
|
|
680
|
+
|
|
681
|
+
<examples>
|
|
682
|
+
<example>
|
|
683
|
+
<request>PUT /products/123/inventory with quantity: 50</request>
|
|
684
|
+
<response>{id: 123, name: "Widget", inventory: 50, updated_at: "2025-11-24T10:00:00Z"}</response>
|
|
685
|
+
</example>
|
|
686
|
+
</examples>
|
|
687
|
+
|
|
688
|
+
Implement this endpoint.
|
|
689
|
+
```
|
|
690
|
+
|
|
691
|
+
**Why it works:** Claude's training specifically recognizes XML tags for better parsing.
|
|
692
|
+
|
|
693
|
+
#### Best Practices
|
|
694
|
+
|
|
695
|
+
1. **Be extremely explicit:** List all requirements including error handling, loading states, edge cases, styling, types.
|
|
696
|
+
2. **Provide WHY context:** Explain reason for requirements (e.g., HIPAA compliance for strict passwords).
|
|
697
|
+
3. **Positive framing:** "Return descriptive error messages" not "Don't return error codes".
|
|
698
|
+
4. **Match format:** JSON prompts for JSON output, code structure for code output, markdown for markdown.
|
|
699
|
+
|
|
700
|
+
### GPT-5 and GPT-4.1
|
|
701
|
+
|
|
702
|
+
**Key characteristics:** Latest general-purpose (GPT-5: 2T parameters). GPT-4.1 optimized for agentic workflows and tool use. Both highly literal.
|
|
703
|
+
|
|
704
|
+
#### Literal Instruction Following
|
|
705
|
+
|
|
706
|
+
GPT models execute exactly what you ask without creative interpretation.
|
|
707
|
+
|
|
708
|
+
**Best practices:**
|
|
709
|
+
|
|
710
|
+
1. **Specify format precisely:** Show exact output structure with example.
|
|
711
|
+
2. **Define boundaries explicitly:** "Exactly 3 paragraphs, 3-4 sentences each".
|
|
712
|
+
3. **Use JSON mode:** Provide schema for structured output.
|
|
713
|
+
|
|
714
|
+
#### Tool Use with GPT-4.1
|
|
715
|
+
|
|
716
|
+
**Best practices:**
|
|
717
|
+
|
|
718
|
+
1. **Define tools with precise schemas:** Include descriptions for all parameters.
|
|
719
|
+
2. **Provide tool use examples:** Show correct calling in system message or few-shot.
|
|
720
|
+
3. **Handle tool errors gracefully:** Clear error messages help model adjust.
|
|
721
|
+
|
|
722
|
+
### Reasoning Models (o3, DeepSeek R1)
|
|
723
|
+
|
|
724
|
+
**Critical differences:** Zero-shot > few-shot (examples hurt), minimal prompting better, built-in reasoning (no CoT needed), higher latency, different use cases (deep reasoning not rapid generation).
|
|
725
|
+
|
|
726
|
+
#### When to Use Reasoning Models
|
|
727
|
+
|
|
728
|
+
**Use for:** Math problems/proofs, complex logical reasoning, scientific analysis, deep thought problems, multi-step solving.
|
|
729
|
+
|
|
730
|
+
**Don't use for:** Simple retrieval, content generation, code formatting, quick facts, high-throughput (use standard models).
|
|
731
|
+
|
|
732
|
+
#### Best Practices
|
|
733
|
+
|
|
734
|
+
1. **Keep prompts simple:** ✅ "Prove square root of 2 is irrational." ❌ "Think step by step. First consider X..."
|
|
735
|
+
2. **Don't provide examples:** ✅ "Solve: [PROBLEM]" ❌ "Example 1: [...] Example 2: [...] Now solve..."
|
|
736
|
+
3. **Let model show reasoning:** "Solve this problem and show your reasoning: [PROBLEM]"
|
|
737
|
+
4. **Trust thinking time:** 30+ seconds is normal and beneficial.
|
|
738
|
+
|
|
739
|
+
#### DeepSeek R1 vs o3
|
|
740
|
+
|
|
741
|
+
**Choose DeepSeek R1:** Budget constraints, self-hosting needs, research projects, high-volume reasoning (27x cheaper, MIT license, 128K context).
|
|
742
|
+
|
|
743
|
+
**Choose o3:** Maximum accuracy required, enterprise SLAs needed, proprietary data (can't self-host).
|
|
744
|
+
|
|
745
|
+
### Gemini 2.5
|
|
746
|
+
|
|
747
|
+
**Key characteristics:** Excellent multimodal, competitive pricing ($1.25/$10 per M tokens), temperature-sensitive, strong benchmarks.
|
|
748
|
+
|
|
749
|
+
#### Temperature Settings
|
|
750
|
+
|
|
751
|
+
**Critical:** More sensitive than other models. **Keep at 1.0** unless specific reason to change.
|
|
752
|
+
|
|
753
|
+
**Effects:** 0.0-0.3 (very deterministic, repetitive), 0.4-0.7 (balanced), 0.8-1.0 (creative, recommended), 1.0+ (increasingly random, caution).
|
|
754
|
+
|
|
755
|
+
#### Best Practices
|
|
756
|
+
|
|
757
|
+
1. **Default temperature 1.0**
|
|
758
|
+
2. **Leverage multimodal:** Combine images + text for richer context
|
|
759
|
+
3. **Cost optimization:** Excellent price/performance for high-volume
|
|
760
|
+
4. **Structured output:** Use JSON mode for consistent formatting
|
|
761
|
+
|
|
762
|
+
---
|
|
763
|
+
|
|
764
|
+
## 6. Quick Reference
|
|
765
|
+
|
|
766
|
+
### Validated Techniques
|
|
767
|
+
|
|
768
|
+
**Core Techniques**
|
|
769
|
+
- ✅ Chain-of-Thought: 80.2% vs 34% baseline
|
|
770
|
+
- ✅ "Take a deep breath and work step-by-step": Simple effective trigger
|
|
771
|
+
- ✅ Few-shot (standard models): 3-5 examples optimal
|
|
772
|
+
- ✅ Zero-shot (reasoning models): Required for o3, DeepSeek R1
|
|
773
|
+
- ✅ Self-consistency: Multiple analyses → convergent conclusions
|
|
774
|
+
- ✅ Tree of Thought: Multi-path exploration
|
|
775
|
+
|
|
776
|
+
**Frameworks**
|
|
777
|
+
- ✅ CO-STAR: Wins competitions for writing
|
|
778
|
+
- ✅ ROSES: Structured decision support
|
|
779
|
+
- ✅ ReAct: 20-30% improvement for complex tasks
|
|
780
|
+
- ✅ Reflexion: 91% pass@1 on HumanEval
|
|
781
|
+
|
|
782
|
+
**Software Engineering**
|
|
783
|
+
- ✅ Security-first two-stage: 50%+ reduction in vulnerabilities
|
|
784
|
+
- ✅ Test-driven development: Reduces hallucination significantly
|
|
785
|
+
- ✅ Architecture-first: Context → Goal → Constraints → Requirements
|
|
786
|
+
- ✅ Explicit instructions (Claude 4): Required for best results
|
|
787
|
+
|
|
788
|
+
**Model-Specific**
|
|
789
|
+
- ✅ XML tags (Claude): Improved structure parsing
|
|
790
|
+
- ✅ Extended thinking (Claude 4.5): 5-7% reasoning gains
|
|
791
|
+
- ✅ Literal formatting (GPT): Precise output control
|
|
792
|
+
- ✅ Temperature 1.0 (Gemini): Optimal default
|
|
793
|
+
|
|
794
|
+
### Debunked Myths
|
|
795
|
+
|
|
796
|
+
**Don't Work**
|
|
797
|
+
- ❌ $200 tip prompting: No consistent effect
|
|
798
|
+
- ❌ "Act as an expert": Zero accuracy improvement
|
|
799
|
+
- ❌ Politeness ("please", "thank you"): No performance benefit
|
|
800
|
+
- ❌ Emotional appeals: Generally ineffective
|
|
801
|
+
- ❌ Few-shot for reasoning models: Actively harms performance
|
|
802
|
+
- ❌ Vague instructions: Claude 4 won't fill gaps
|
|
803
|
+
- ❌ Negative framing: Less effective than positive
|
|
804
|
+
|
|
805
|
+
**Outdated**
|
|
806
|
+
- ❌ GPT-3 era techniques: Modern models fundamentally different
|
|
807
|
+
- ❌ Excessive prompt engineering: Many 2022-2023 "tricks" don't help
|
|
808
|
+
- ❌ One-size-fits-all: Model-specific optimization critical
|
|
809
|
+
|
|
810
|
+
### Common Pitfalls
|
|
811
|
+
|
|
812
|
+
**Pitfall 1: Few-Shot with Reasoning Models**
|
|
813
|
+
Problem: Examples reduce o3/DeepSeek R1 accuracy. Solution: Zero-shot only.
|
|
814
|
+
|
|
815
|
+
**Pitfall 2: Implicit Requirements with Claude 4**
|
|
816
|
+
Problem: Won't infer unstated needs. Solution: Extremely explicit about all requirements.
|
|
817
|
+
|
|
818
|
+
**Pitfall 3: Ignoring Security**
|
|
819
|
+
Problem: 40%+ of AI code has vulnerabilities without security prompting. Solution: Two-stage prompting.
|
|
820
|
+
|
|
821
|
+
**Pitfall 4: Critical Info in Middle**
|
|
822
|
+
Problem: Reduced attention in middle ("lost in the middle"). Solution: Place at beginning or end.
|
|
823
|
+
|
|
824
|
+
**Pitfall 5: Format Mismatch**
|
|
825
|
+
Problem: Messy prompt → messy output. Solution: Structure prompt like desired output.
|
|
826
|
+
|
|
827
|
+
**Pitfall 6: Wrong Model for Task**
|
|
828
|
+
Problem: Reasoning model for simple generation, or standard for deep reasoning. Solution: See Model Selection Framework.
|
|
829
|
+
|
|
830
|
+
**Pitfall 7: Insufficient Context**
|
|
831
|
+
Problem: Model lacks info to complete correctly. Solution: Provide code, constraints, requirements, why it matters.
|
|
832
|
+
|
|
833
|
+
### Performance Benchmarks
|
|
834
|
+
|
|
835
|
+
**SWE-bench Verified (Code Generation)**
|
|
836
|
+
- Claude Sonnet 4.5: **77.2%** (82.0% high compute)
|
|
837
|
+
- Claude Opus 4.1: 74.5%
|
|
838
|
+
- Claude Haiku 4.5: 73.3%
|
|
839
|
+
- Claude Sonnet 4: 72.7%
|
|
840
|
+
|
|
841
|
+
**HumanEval (Code Correctness)**
|
|
842
|
+
- Reflexion pattern: **91% pass@1**
|
|
843
|
+
- Standard prompting: ~65% pass@1
|
|
844
|
+
|
|
845
|
+
**Reasoning Tasks**
|
|
846
|
+
- Chain-of-Thought: 80.2%
|
|
847
|
+
- Baseline (no CoT): 34%
|
|
848
|
+
|
|
849
|
+
**Computer Use (OSWorld)**
|
|
850
|
+
- Claude Sonnet 4.5: **61.4%** (45% improvement over Sonnet 4)
|
|
851
|
+
- Claude Sonnet 4: 42.2%
|
|
852
|
+
|
|
853
|
+
**Security Improvements**
|
|
854
|
+
- Security-first two-stage: **50%+ reduction** in vulnerabilities
|
|
855
|
+
- Input validation coverage: 82-84% (vs 16-18% without security prompting)
|