loki-mode 4.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +691 -0
- package/SKILL.md +191 -0
- package/VERSION +1 -0
- package/autonomy/.loki/dashboard/index.html +2634 -0
- package/autonomy/CONSTITUTION.md +508 -0
- package/autonomy/README.md +201 -0
- package/autonomy/config.example.yaml +152 -0
- package/autonomy/loki +526 -0
- package/autonomy/run.sh +3636 -0
- package/bin/loki-mode.js +26 -0
- package/bin/postinstall.js +60 -0
- package/docs/ACKNOWLEDGEMENTS.md +234 -0
- package/docs/COMPARISON.md +325 -0
- package/docs/COMPETITIVE-ANALYSIS.md +333 -0
- package/docs/INSTALLATION.md +547 -0
- package/docs/auto-claude-comparison.md +276 -0
- package/docs/cursor-comparison.md +225 -0
- package/docs/dashboard-guide.md +355 -0
- package/docs/screenshots/README.md +149 -0
- package/docs/screenshots/dashboard-agents.png +0 -0
- package/docs/screenshots/dashboard-tasks.png +0 -0
- package/docs/thick2thin.md +173 -0
- package/package.json +48 -0
- package/references/advanced-patterns.md +453 -0
- package/references/agent-types.md +243 -0
- package/references/agents.md +1043 -0
- package/references/business-ops.md +550 -0
- package/references/competitive-analysis.md +216 -0
- package/references/confidence-routing.md +371 -0
- package/references/core-workflow.md +275 -0
- package/references/cursor-learnings.md +207 -0
- package/references/deployment.md +604 -0
- package/references/lab-research-patterns.md +534 -0
- package/references/mcp-integration.md +186 -0
- package/references/memory-system.md +467 -0
- package/references/openai-patterns.md +647 -0
- package/references/production-patterns.md +568 -0
- package/references/prompt-repetition.md +192 -0
- package/references/quality-control.md +437 -0
- package/references/sdlc-phases.md +410 -0
- package/references/task-queue.md +361 -0
- package/references/tool-orchestration.md +691 -0
- package/skills/00-index.md +120 -0
- package/skills/agents.md +249 -0
- package/skills/artifacts.md +174 -0
- package/skills/github-integration.md +218 -0
- package/skills/model-selection.md +125 -0
- package/skills/parallel-workflows.md +526 -0
- package/skills/patterns-advanced.md +188 -0
- package/skills/production.md +292 -0
- package/skills/quality-gates.md +180 -0
- package/skills/testing.md +149 -0
- package/skills/troubleshooting.md +109 -0
|
@@ -0,0 +1,568 @@
|
|
|
1
|
+
# Production Patterns Reference
|
|
2
|
+
|
|
3
|
+
Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
This reference consolidates battle-tested insights from:
|
|
10
|
+
- HN discussions on autonomous agents in production (2025)
|
|
11
|
+
- Coding with LLMs practitioner experiences
|
|
12
|
+
- Simon Willison's Superpowers coding agent patterns
|
|
13
|
+
- Multi-agent orchestration real-world deployments
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## What Actually Works in Production
|
|
18
|
+
|
|
19
|
+
### Human-in-the-Loop (HITL) is Non-Negotiable
|
|
20
|
+
|
|
21
|
+
**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
|
|
22
|
+
|
|
23
|
+
```yaml
|
|
24
|
+
hitl_patterns:
|
|
25
|
+
always_human:
|
|
26
|
+
- Customer-facing responses
|
|
27
|
+
- Financial transactions
|
|
28
|
+
- Security-critical operations
|
|
29
|
+
- Legal/compliance decisions
|
|
30
|
+
|
|
31
|
+
automation_candidates:
|
|
32
|
+
- Internal tooling
|
|
33
|
+
- Developer assistance
|
|
34
|
+
- Data preprocessing
|
|
35
|
+
- Code generation (with review)
|
|
36
|
+
|
|
37
|
+
implementation:
|
|
38
|
+
- Classification layer routes to human vs automated
|
|
39
|
+
- Confidence thresholds trigger escalation
|
|
40
|
+
- Audit trails for all automated decisions
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Narrow Scope Wins
|
|
44
|
+
|
|
45
|
+
**Key Insight:** Successful agents operate within tightly constrained domains.
|
|
46
|
+
|
|
47
|
+
```yaml
|
|
48
|
+
scope_constraints:
|
|
49
|
+
max_steps_before_review: 3-5
|
|
50
|
+
task_characteristics:
|
|
51
|
+
- Specific, well-defined objectives
|
|
52
|
+
- Pre-classified inputs
|
|
53
|
+
- Deterministic success criteria
|
|
54
|
+
- Verifiable outputs
|
|
55
|
+
|
|
56
|
+
successful_domains:
|
|
57
|
+
- Email scanning and classification
|
|
58
|
+
- Invoice processing
|
|
59
|
+
- Code refactoring (bounded)
|
|
60
|
+
- Documentation generation
|
|
61
|
+
- Test writing
|
|
62
|
+
|
|
63
|
+
failure_prone_domains:
|
|
64
|
+
- Open-ended feature implementation
|
|
65
|
+
- Novel algorithm design
|
|
66
|
+
- Security-critical code
|
|
67
|
+
- Cross-system integrations
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Confidence-Based Routing
|
|
71
|
+
|
|
72
|
+
**Key Insight:** Treat agents as preprocessors, not decision-makers.
|
|
73
|
+
|
|
74
|
+
```python
|
|
75
|
+
def confidence_based_routing(agent_output):
|
|
76
|
+
"""
|
|
77
|
+
Route based on confidence, not capability.
|
|
78
|
+
Based on production practitioner patterns.
|
|
79
|
+
"""
|
|
80
|
+
confidence = agent_output.confidence_score
|
|
81
|
+
|
|
82
|
+
if confidence >= 0.95:
|
|
83
|
+
# High confidence: auto-approve with logging
|
|
84
|
+
return AutoApprove(audit_log=True)
|
|
85
|
+
|
|
86
|
+
elif confidence >= 0.70:
|
|
87
|
+
# Medium confidence: quick human review
|
|
88
|
+
return HumanReview(priority="normal", timeout="1h")
|
|
89
|
+
|
|
90
|
+
elif confidence >= 0.40:
|
|
91
|
+
# Low confidence: detailed human review
|
|
92
|
+
return HumanReview(priority="high", context="full")
|
|
93
|
+
|
|
94
|
+
else:
|
|
95
|
+
# Very low confidence: escalate immediately
|
|
96
|
+
return Escalate(reason="low_confidence", require_senior=True)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### Classification Before Automation
|
|
100
|
+
|
|
101
|
+
**Key Insight:** Separate inputs before processing.
|
|
102
|
+
|
|
103
|
+
```yaml
|
|
104
|
+
classification_first:
|
|
105
|
+
step_1_classify:
|
|
106
|
+
workable:
|
|
107
|
+
- Clear requirements
|
|
108
|
+
- Existing patterns
|
|
109
|
+
- Test coverage available
|
|
110
|
+
non_workable:
|
|
111
|
+
- Ambiguous requirements
|
|
112
|
+
- Novel architecture
|
|
113
|
+
- Missing dependencies
|
|
114
|
+
escalate_immediately:
|
|
115
|
+
- Security concerns
|
|
116
|
+
- Compliance requirements
|
|
117
|
+
- Customer-facing changes
|
|
118
|
+
|
|
119
|
+
step_2_route:
|
|
120
|
+
workable: "Automated pipeline"
|
|
121
|
+
non_workable: "Human clarification"
|
|
122
|
+
escalate: "Senior review"
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Deterministic Outer Loops
|
|
126
|
+
|
|
127
|
+
**Key Insight:** Wrap agent outputs with rule-based validation.
|
|
128
|
+
|
|
129
|
+
```python
|
|
130
|
+
def deterministic_validation_loop(task, max_attempts=3):
|
|
131
|
+
"""
|
|
132
|
+
Use LLMs only where genuine ambiguity exists.
|
|
133
|
+
Wrap with deterministic rules.
|
|
134
|
+
"""
|
|
135
|
+
for attempt in range(max_attempts):
|
|
136
|
+
# LLM handles the ambiguous part
|
|
137
|
+
output = agent.execute(task)
|
|
138
|
+
|
|
139
|
+
# Deterministic validation (NOT LLM)
|
|
140
|
+
validation_errors = []
|
|
141
|
+
|
|
142
|
+
# Rule: Must have tests
|
|
143
|
+
if not output.has_tests:
|
|
144
|
+
validation_errors.append("Missing tests")
|
|
145
|
+
|
|
146
|
+
# Rule: Must pass linting
|
|
147
|
+
lint_result = run_linter(output.code)
|
|
148
|
+
if lint_result.errors:
|
|
149
|
+
validation_errors.append(f"Lint errors: {lint_result.errors}")
|
|
150
|
+
|
|
151
|
+
# Rule: Must compile
|
|
152
|
+
compile_result = compile_code(output.code)
|
|
153
|
+
if not compile_result.success:
|
|
154
|
+
validation_errors.append(f"Compile error: {compile_result.error}")
|
|
155
|
+
|
|
156
|
+
# Rule: Tests must pass
|
|
157
|
+
if output.has_tests:
|
|
158
|
+
test_result = run_tests(output.code)
|
|
159
|
+
if not test_result.all_passed:
|
|
160
|
+
validation_errors.append(f"Test failures: {test_result.failures}")
|
|
161
|
+
|
|
162
|
+
if not validation_errors:
|
|
163
|
+
return output
|
|
164
|
+
|
|
165
|
+
# Feed errors back for retry
|
|
166
|
+
task = task.with_feedback(validation_errors)
|
|
167
|
+
|
|
168
|
+
return FailedResult(reason="Max attempts exceeded")
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## Context Engineering Patterns
|
|
174
|
+
|
|
175
|
+
### Context Curation Over Automatic Selection
|
|
176
|
+
|
|
177
|
+
**Key Insight:** Manually choose which files and information to provide.
|
|
178
|
+
|
|
179
|
+
```yaml
|
|
180
|
+
context_curation:
|
|
181
|
+
principles:
|
|
182
|
+
- "Less is more" - focused context beats comprehensive context
|
|
183
|
+
- Manual selection outperforms automatic RAG
|
|
184
|
+
- Remove outdated information aggressively
|
|
185
|
+
|
|
186
|
+
anti_patterns:
|
|
187
|
+
- Dumping entire codebase into context
|
|
188
|
+
- Relying on automatic context selection
|
|
189
|
+
- Accumulating conversation history indefinitely
|
|
190
|
+
|
|
191
|
+
implementation:
|
|
192
|
+
per_task_context:
|
|
193
|
+
- 2-5 most relevant files
|
|
194
|
+
- Specific functions, not entire modules
|
|
195
|
+
- Recent changes only (last 1-2 days)
|
|
196
|
+
- Clear success criteria
|
|
197
|
+
|
|
198
|
+
context_budget:
|
|
199
|
+
target: "< 10k tokens for context"
|
|
200
|
+
reserve: "90% for model reasoning"
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
### Information Abstraction
|
|
204
|
+
|
|
205
|
+
**Key Insight:** Summarize rather than feeding full data.
|
|
206
|
+
|
|
207
|
+
```python
|
|
208
|
+
def abstract_for_agent(raw_data, task_context):
|
|
209
|
+
"""
|
|
210
|
+
Design abstractions that preserve decision-relevant information.
|
|
211
|
+
Based on practitioner insights.
|
|
212
|
+
"""
|
|
213
|
+
# BAD: Feed 10,000 database rows
|
|
214
|
+
# raw_data = db.query("SELECT * FROM users")
|
|
215
|
+
|
|
216
|
+
# GOOD: Summarize to decision-relevant info
|
|
217
|
+
summary = {
|
|
218
|
+
"query_status": "success",
|
|
219
|
+
"total_results": len(raw_data),
|
|
220
|
+
"sample": raw_data[:5],
|
|
221
|
+
"schema": extract_schema(raw_data),
|
|
222
|
+
"statistics": {
|
|
223
|
+
"null_count": count_nulls(raw_data),
|
|
224
|
+
"unique_values": count_uniques(raw_data),
|
|
225
|
+
"date_range": get_date_range(raw_data)
|
|
226
|
+
}
|
|
227
|
+
}
|
|
228
|
+
|
|
229
|
+
return summary
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
### Separate Conversations Per Task
|
|
233
|
+
|
|
234
|
+
**Key Insight:** Fresh contexts yield better results than accumulated sessions.
|
|
235
|
+
|
|
236
|
+
```yaml
|
|
237
|
+
conversation_management:
|
|
238
|
+
new_conversation_triggers:
|
|
239
|
+
- Different domain (backend -> frontend)
|
|
240
|
+
- New feature vs bug fix
|
|
241
|
+
- After completing major task
|
|
242
|
+
- When errors accumulate (3+ in row)
|
|
243
|
+
|
|
244
|
+
preserve_across_sessions:
|
|
245
|
+
- CLAUDE.md / CONTINUITY.md
|
|
246
|
+
- Architectural decisions
|
|
247
|
+
- Key constraints
|
|
248
|
+
|
|
249
|
+
discard_between_sessions:
|
|
250
|
+
- Debugging attempts
|
|
251
|
+
- Abandoned approaches
|
|
252
|
+
- Intermediate drafts
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## Skills System Pattern
|
|
258
|
+
|
|
259
|
+
### On-Demand Skill Loading
|
|
260
|
+
|
|
261
|
+
**Key Insight:** Skills remain dormant until the model actively seeks them out.
|
|
262
|
+
|
|
263
|
+
```yaml
|
|
264
|
+
skills_architecture:
|
|
265
|
+
core_interaction: "< 2k tokens"
|
|
266
|
+
skill_loading: "On-demand via search"
|
|
267
|
+
|
|
268
|
+
implementation:
|
|
269
|
+
skill_discovery:
|
|
270
|
+
- Shell script searches skill files
|
|
271
|
+
- Model requests specific skills by name
|
|
272
|
+
- Skills loaded only when needed
|
|
273
|
+
|
|
274
|
+
skill_structure:
|
|
275
|
+
name: "unique-skill-name"
|
|
276
|
+
trigger: "Pattern that activates skill"
|
|
277
|
+
content: "Detailed instructions"
|
|
278
|
+
dependencies: ["other-skills"]
|
|
279
|
+
|
|
280
|
+
benefits:
|
|
281
|
+
- Minimal base context
|
|
282
|
+
- Extensible without bloat
|
|
283
|
+
- Skills can be updated independently
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
### Sub-Agents for Context Isolation
|
|
287
|
+
|
|
288
|
+
**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
|
|
289
|
+
|
|
290
|
+
```python
|
|
291
|
+
async def context_isolated_search(query, codebase_path):
|
|
292
|
+
"""
|
|
293
|
+
Use sub-agent for grep/search to prevent context pollution.
|
|
294
|
+
Based on Simon Willison's patterns.
|
|
295
|
+
"""
|
|
296
|
+
# Main agent stays focused
|
|
297
|
+
# Sub-agent handles noisy file searching
|
|
298
|
+
|
|
299
|
+
search_agent = spawn_subagent(
|
|
300
|
+
role="codebase-searcher",
|
|
301
|
+
context_limit="10k tokens",
|
|
302
|
+
permissions=["read-only"]
|
|
303
|
+
)
|
|
304
|
+
|
|
305
|
+
results = await search_agent.execute(
|
|
306
|
+
task=f"Find files related to: {query}",
|
|
307
|
+
codebase=codebase_path
|
|
308
|
+
)
|
|
309
|
+
|
|
310
|
+
# Return only relevant paths, not full content
|
|
311
|
+
return FilteredResults(
|
|
312
|
+
paths=results.relevant_files[:10],
|
|
313
|
+
summaries=results.file_summaries,
|
|
314
|
+
confidence=results.relevance_scores
|
|
315
|
+
)
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
---
|
|
319
|
+
|
|
320
|
+
## Planning Before Execution
|
|
321
|
+
|
|
322
|
+
### Explicit Plan-Then-Code Workflow
|
|
323
|
+
|
|
324
|
+
**Key Insight:** Have models articulate detailed plans without immediately writing code.
|
|
325
|
+
|
|
326
|
+
```yaml
|
|
327
|
+
plan_then_code:
|
|
328
|
+
phase_1_planning:
|
|
329
|
+
outputs:
|
|
330
|
+
- spec.md: "Detailed requirements"
|
|
331
|
+
- todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
|
|
332
|
+
- approach.md: "Implementation strategy"
|
|
333
|
+
constraints:
|
|
334
|
+
- NO CODE in this phase
|
|
335
|
+
- Human review before proceeding
|
|
336
|
+
- Clear success criteria
|
|
337
|
+
|
|
338
|
+
phase_2_review:
|
|
339
|
+
checks:
|
|
340
|
+
- Plan addresses all requirements
|
|
341
|
+
- Approach is feasible
|
|
342
|
+
- No missing dependencies
|
|
343
|
+
- Tests are specified
|
|
344
|
+
|
|
345
|
+
phase_3_implementation:
|
|
346
|
+
constraints:
|
|
347
|
+
- Follow plan exactly
|
|
348
|
+
- One task at a time
|
|
349
|
+
- Test after each change
|
|
350
|
+
- Report deviations immediately
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
---
|
|
354
|
+
|
|
355
|
+
## Multi-Agent Orchestration Patterns
|
|
356
|
+
|
|
357
|
+
### Event-Driven Coordination
|
|
358
|
+
|
|
359
|
+
**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
|
|
360
|
+
|
|
361
|
+
```yaml
|
|
362
|
+
event_driven_orchestration:
|
|
363
|
+
problems_with_synchronous:
|
|
364
|
+
- Doesn't scale
|
|
365
|
+
- Mixes orchestration with prompt logic
|
|
366
|
+
- Single failure breaks entire chain
|
|
367
|
+
- No retry/recovery mechanism
|
|
368
|
+
|
|
369
|
+
async_architecture:
|
|
370
|
+
message_queue:
|
|
371
|
+
- Agents communicate via events
|
|
372
|
+
- Decoupled execution
|
|
373
|
+
- Natural retry/dead-letter handling
|
|
374
|
+
|
|
375
|
+
state_management:
|
|
376
|
+
- Persistent task state
|
|
377
|
+
- Checkpoint/resume capability
|
|
378
|
+
- Clear ownership of data
|
|
379
|
+
|
|
380
|
+
error_handling:
|
|
381
|
+
- Per-agent retry policies
|
|
382
|
+
- Circuit breakers
|
|
383
|
+
- Graceful degradation
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
### Policy-First Enforcement
|
|
387
|
+
|
|
388
|
+
**Key Insight:** Govern agent behavior at runtime, not just training time.
|
|
389
|
+
|
|
390
|
+
```python
|
|
391
|
+
class PolicyEngine:
|
|
392
|
+
"""
|
|
393
|
+
Runtime governance for agent behavior.
|
|
394
|
+
Based on autonomous control plane patterns.
|
|
395
|
+
"""
|
|
396
|
+
|
|
397
|
+
def __init__(self, policies):
|
|
398
|
+
self.policies = policies
|
|
399
|
+
|
|
400
|
+
async def enforce(self, agent_action, context):
|
|
401
|
+
for policy in self.policies:
|
|
402
|
+
result = await policy.evaluate(agent_action, context)
|
|
403
|
+
|
|
404
|
+
if result.blocked:
|
|
405
|
+
return BlockedAction(
|
|
406
|
+
reason=result.reason,
|
|
407
|
+
policy=policy.name,
|
|
408
|
+
remediation=result.suggested_action
|
|
409
|
+
)
|
|
410
|
+
|
|
411
|
+
if result.modified:
|
|
412
|
+
agent_action = result.modified_action
|
|
413
|
+
|
|
414
|
+
return AllowedAction(agent_action)
|
|
415
|
+
|
|
416
|
+
# Example policies
|
|
417
|
+
policies = [
|
|
418
|
+
NoProductionDataDeletion(),
|
|
419
|
+
NoSecretsInCode(),
|
|
420
|
+
MaxTokenBudget(limit=100000),
|
|
421
|
+
RequireTestsForCode(),
|
|
422
|
+
BlockExternalNetworkCalls(in_sandbox=True)
|
|
423
|
+
]
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
### Simulation Layer
|
|
427
|
+
|
|
428
|
+
**Key Insight:** Evaluate changes before deploying to real environment.
|
|
429
|
+
|
|
430
|
+
```yaml
|
|
431
|
+
simulation_layer:
|
|
432
|
+
purpose: "Test agent behavior in safe environment"
|
|
433
|
+
|
|
434
|
+
implementation:
|
|
435
|
+
sandbox_environment:
|
|
436
|
+
- Isolated container
|
|
437
|
+
- Mocked external services
|
|
438
|
+
- Synthetic data
|
|
439
|
+
- Full audit logging
|
|
440
|
+
|
|
441
|
+
validation_checks:
|
|
442
|
+
- Run tests in sandbox first
|
|
443
|
+
- Compare outputs to expected
|
|
444
|
+
- Check for policy violations
|
|
445
|
+
- Measure resource consumption
|
|
446
|
+
|
|
447
|
+
promotion_criteria:
|
|
448
|
+
- All tests pass
|
|
449
|
+
- No policy violations
|
|
450
|
+
- Resource usage within limits
|
|
451
|
+
- Human approval (for sensitive changes)
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
---
|
|
455
|
+
|
|
456
|
+
## Evaluation and Benchmarking
|
|
457
|
+
|
|
458
|
+
### Problems with Current Benchmarks
|
|
459
|
+
|
|
460
|
+
**Key Insight:** LLM-as-judge creates shared blind spots.
|
|
461
|
+
|
|
462
|
+
```yaml
|
|
463
|
+
benchmark_problems:
|
|
464
|
+
llm_judge_issues:
|
|
465
|
+
- Same architecture = same failure modes
|
|
466
|
+
- Math errors accepted as correct
|
|
467
|
+
- "Do-nothing" baseline passes 38% of time
|
|
468
|
+
|
|
469
|
+
contamination:
|
|
470
|
+
- Published benchmarks become training targets
|
|
471
|
+
- Overfitting to specific datasets
|
|
472
|
+
- Inflated scores don't reflect real performance
|
|
473
|
+
|
|
474
|
+
solutions:
|
|
475
|
+
held_back_sets: "90% public, 10% private"
|
|
476
|
+
human_evaluation: "Final published results require humans"
|
|
477
|
+
production_testing: "A/B tests measure actual value"
|
|
478
|
+
objective_outcomes: "Simulated environments with verifiable results"
|
|
479
|
+
```
|
|
480
|
+
|
|
481
|
+
### Practical Evaluation Approach
|
|
482
|
+
|
|
483
|
+
```python
|
|
484
|
+
def evaluate_agent_change(before_agent, after_agent, task_set):
|
|
485
|
+
"""
|
|
486
|
+
Production-oriented evaluation.
|
|
487
|
+
Based on HN practitioner recommendations.
|
|
488
|
+
"""
|
|
489
|
+
results = {
|
|
490
|
+
"before": [],
|
|
491
|
+
"after": [],
|
|
492
|
+
"human_preference": []
|
|
493
|
+
}
|
|
494
|
+
|
|
495
|
+
for task in task_set:
|
|
496
|
+
# Run both agents
|
|
497
|
+
before_result = before_agent.execute(task)
|
|
498
|
+
after_result = after_agent.execute(task)
|
|
499
|
+
|
|
500
|
+
# Objective metrics (NOT LLM-judged)
|
|
501
|
+
results["before"].append({
|
|
502
|
+
"tests_pass": run_tests(before_result),
|
|
503
|
+
"lint_clean": run_linter(before_result),
|
|
504
|
+
"time_taken": before_result.duration,
|
|
505
|
+
"tokens_used": before_result.tokens
|
|
506
|
+
})
|
|
507
|
+
|
|
508
|
+
results["after"].append({
|
|
509
|
+
"tests_pass": run_tests(after_result),
|
|
510
|
+
"lint_clean": run_linter(after_result),
|
|
511
|
+
"time_taken": after_result.duration,
|
|
512
|
+
"tokens_used": after_result.tokens
|
|
513
|
+
})
|
|
514
|
+
|
|
515
|
+
# Sample for human review
|
|
516
|
+
if random.random() < 0.1: # 10% sample
|
|
517
|
+
results["human_preference"].append({
|
|
518
|
+
"task": task,
|
|
519
|
+
"before": before_result,
|
|
520
|
+
"after": after_result,
|
|
521
|
+
"pending_review": True
|
|
522
|
+
})
|
|
523
|
+
|
|
524
|
+
return EvaluationReport(results)
|
|
525
|
+
```
|
|
526
|
+
|
|
527
|
+
---
|
|
528
|
+
|
|
529
|
+
## Cost and Token Economics
|
|
530
|
+
|
|
531
|
+
### Real-World Cost Patterns
|
|
532
|
+
|
|
533
|
+
```yaml
|
|
534
|
+
cost_patterns:
|
|
535
|
+
claude_code:
|
|
536
|
+
heavy_use: "$25/1-2 hours on large codebases"
|
|
537
|
+
api_range: "$1-5/hour depending on efficiency"
|
|
538
|
+
max_tier: "$200/month often needs 2-3 subscriptions"
|
|
539
|
+
|
|
540
|
+
token_economics:
|
|
541
|
+
sub_agents_multiply_cost: "Each duplicates context"
|
|
542
|
+
example: "5-task parallel job = 50,000+ tokens per subtask"
|
|
543
|
+
|
|
544
|
+
optimization:
|
|
545
|
+
context_isolation: "Use sub-agents for noisy tasks"
|
|
546
|
+
information_abstraction: "Summarize, don't dump"
|
|
547
|
+
fresh_conversations: "Reset after major tasks"
|
|
548
|
+
skill_on_demand: "Load only when needed"
|
|
549
|
+
```
|
|
550
|
+
|
|
551
|
+
---
|
|
552
|
+
|
|
553
|
+
## Sources
|
|
554
|
+
|
|
555
|
+
**Hacker News Discussions:**
|
|
556
|
+
- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
|
|
557
|
+
- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
|
|
558
|
+
- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
|
|
559
|
+
- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
|
|
560
|
+
- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
|
|
561
|
+
- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
|
|
562
|
+
- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
|
|
563
|
+
|
|
564
|
+
**Show HN Projects:**
|
|
565
|
+
- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
|
|
566
|
+
- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
|
|
567
|
+
- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
|
|
568
|
+
- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)
|