opencode-skills-collection 1.0.186 → 1.0.187

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +5 -1
  2. package/bundled-skills/3d-web-experience/SKILL.md +152 -37
  3. package/bundled-skills/agent-evaluation/SKILL.md +1088 -26
  4. package/bundled-skills/agent-memory-systems/SKILL.md +1037 -25
  5. package/bundled-skills/agent-tool-builder/SKILL.md +668 -16
  6. package/bundled-skills/ai-agents-architect/SKILL.md +271 -31
  7. package/bundled-skills/ai-product/SKILL.md +716 -26
  8. package/bundled-skills/ai-wrapper-product/SKILL.md +450 -44
  9. package/bundled-skills/algolia-search/SKILL.md +867 -15
  10. package/bundled-skills/autonomous-agents/SKILL.md +1033 -26
  11. package/bundled-skills/aws-serverless/SKILL.md +1046 -35
  12. package/bundled-skills/azure-functions/SKILL.md +1318 -19
  13. package/bundled-skills/browser-automation/SKILL.md +1065 -28
  14. package/bundled-skills/browser-extension-builder/SKILL.md +159 -32
  15. package/bundled-skills/bullmq-specialist/SKILL.md +347 -16
  16. package/bundled-skills/clerk-auth/SKILL.md +796 -15
  17. package/bundled-skills/computer-use-agents/SKILL.md +1870 -28
  18. package/bundled-skills/context-window-management/SKILL.md +271 -18
  19. package/bundled-skills/conversation-memory/SKILL.md +453 -24
  20. package/bundled-skills/crewai/SKILL.md +252 -46
  21. package/bundled-skills/discord-bot-architect/SKILL.md +1207 -34
  22. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  23. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  24. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  25. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  26. package/bundled-skills/docs/users/bundles.md +1 -1
  27. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  28. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  29. package/bundled-skills/docs/users/getting-started.md +1 -1
  30. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  31. package/bundled-skills/docs/users/usage.md +4 -4
  32. package/bundled-skills/docs/users/visual-guide.md +4 -4
  33. package/bundled-skills/email-systems/SKILL.md +646 -26
  34. package/bundled-skills/faf-expert/SKILL.md +221 -0
  35. package/bundled-skills/faf-wizard/SKILL.md +252 -0
  36. package/bundled-skills/file-uploads/SKILL.md +212 -11
  37. package/bundled-skills/firebase/SKILL.md +646 -16
  38. package/bundled-skills/gcp-cloud-run/SKILL.md +1117 -32
  39. package/bundled-skills/graphql/SKILL.md +1026 -27
  40. package/bundled-skills/hubspot-integration/SKILL.md +804 -19
  41. package/bundled-skills/idea-darwin/SKILL.md +120 -0
  42. package/bundled-skills/inngest/SKILL.md +431 -16
  43. package/bundled-skills/interactive-portfolio/SKILL.md +342 -44
  44. package/bundled-skills/langfuse/SKILL.md +296 -41
  45. package/bundled-skills/langgraph/SKILL.md +259 -50
  46. package/bundled-skills/micro-saas-launcher/SKILL.md +343 -44
  47. package/bundled-skills/neon-postgres/SKILL.md +572 -15
  48. package/bundled-skills/nextjs-supabase-auth/SKILL.md +269 -21
  49. package/bundled-skills/notion-template-business/SKILL.md +371 -44
  50. package/bundled-skills/personal-tool-builder/SKILL.md +537 -44
  51. package/bundled-skills/plaid-fintech/SKILL.md +825 -19
  52. package/bundled-skills/prompt-caching/SKILL.md +438 -25
  53. package/bundled-skills/rag-engineer/SKILL.md +271 -29
  54. package/bundled-skills/salesforce-development/SKILL.md +912 -19
  55. package/bundled-skills/satori/SKILL.md +54 -0
  56. package/bundled-skills/scroll-experience/SKILL.md +381 -44
  57. package/bundled-skills/segment-cdp/SKILL.md +817 -19
  58. package/bundled-skills/shopify-apps/SKILL.md +1475 -19
  59. package/bundled-skills/slack-bot-builder/SKILL.md +1162 -28
  60. package/bundled-skills/telegram-bot-builder/SKILL.md +152 -37
  61. package/bundled-skills/telegram-mini-app/SKILL.md +445 -44
  62. package/bundled-skills/trigger-dev/SKILL.md +916 -27
  63. package/bundled-skills/twilio-communications/SKILL.md +1310 -28
  64. package/bundled-skills/upstash-qstash/SKILL.md +898 -27
  65. package/bundled-skills/vercel-deployment/SKILL.md +637 -39
  66. package/bundled-skills/viral-generator-builder/SKILL.md +132 -37
  67. package/bundled-skills/voice-agents/SKILL.md +937 -27
  68. package/bundled-skills/voice-ai-development/SKILL.md +375 -46
  69. package/bundled-skills/workflow-automation/SKILL.md +982 -29
  70. package/bundled-skills/zapier-make-patterns/SKILL.md +772 -27
  71. package/package.json +1 -1
@@ -1,22 +1,39 @@
1
1
  ---
2
2
  name: autonomous-agents
3
- description: "You are an agent architect who has learned the hard lessons of autonomous AI. You've seen the gap between impressive demos and production disasters. You know that a 95% success rate per step means only 60% by step 10."
3
+ description: Autonomous agents are AI systems that can independently decompose
4
+ goals, plan actions, execute tools, and self-correct without constant human
5
+ guidance. The challenge isn't making them capable - it's making them reliable.
6
+ Every extra decision multiplies failure probability.
4
7
  risk: unknown
5
- source: "vibeship-spawner-skills (Apache 2.0)"
6
- date_added: "2026-02-27"
8
+ source: vibeship-spawner-skills (Apache 2.0)
9
+ date_added: 2026-02-27
7
10
  ---
8
11
 
9
12
  # Autonomous Agents
10
13
 
11
- You are an agent architect who has learned the hard lessons of autonomous AI.
12
- You've seen the gap between impressive demos and production disasters. You know
13
- that a 95% success rate per step means only 60% by step 10.
14
+ Autonomous agents are AI systems that can independently decompose goals,
15
+ plan actions, execute tools, and self-correct without constant human guidance.
16
+ The challenge isn't making them capable - it's making them reliable. Every
17
+ extra decision multiplies failure probability.
14
18
 
15
- Your core insight: Autonomy is earned, not granted. Start with heavily
16
- constrained agents that do one thing reliably. Add autonomy only as you prove
17
- reliability. The best agents look less impressive but work consistently.
19
+ This skill covers agent loops (ReAct, Plan-Execute), goal decomposition,
20
+ reflection patterns, and production reliability. Key insight: compounding
21
+ error rates kill autonomous agents. A 95% success rate per step drops to
22
+ 60% by step 10. Build for reliability first, autonomy second.
18
23
 
19
- You push for guardrails before capabilities, logging befor
24
+ 2025 lesson: The winners are constrained, domain-specific agents with clear
25
+ boundaries, not "autonomous everything." Treat AI outputs as proposals,
26
+ not truth.
27
+
28
+ ## Principles
29
+
30
+ - Reliability over autonomy - every step compounds error probability
31
+ - Constrain scope - domain-specific beats general-purpose
32
+ - Treat outputs as proposals, not truth
33
+ - Build guardrails before expanding capabilities
34
+ - Human-in-the-loop for critical decisions is non-negotiable
35
+ - Log everything - every action must be auditable
36
+ - Fail safely with rollback, not silently with corruption
20
37
 
21
38
  ## Capabilities
22
39
 
@@ -30,44 +47,1034 @@ You push for guardrails before capabilities, logging befor
30
47
  - agent-reliability
31
48
  - agent-guardrails
32
49
 
50
+ ## Scope
51
+
52
+ - multi-agent-systems → multi-agent-orchestration
53
+ - tool-building → agent-tool-builder
54
+ - memory-systems → agent-memory-systems
55
+ - workflow-orchestration → workflow-automation
56
+
57
+ ## Tooling
58
+
59
+ ### Frameworks
60
+
61
+ - LangGraph - When: Production agents with state management Note: 1.0 released Oct 2025, checkpointing, human-in-loop
62
+ - AutoGPT - When: Research/experimentation, open-ended exploration Note: Needs external guardrails for production
63
+ - CrewAI - When: Role-based agent teams Note: Good for specialized agent collaboration
64
+ - Claude Agent SDK - When: Anthropic ecosystem agents Note: Computer use, tool execution
65
+
66
+ ### Patterns
67
+
68
+ - ReAct - When: Reasoning + Acting in alternating steps Note: Foundation for most modern agents
69
+ - Plan-Execute - When: Separate planning from execution Note: Better for complex multi-step tasks
70
+ - Reflection - When: Self-evaluation and correction Note: Evaluator-optimizer loop
71
+
33
72
  ## Patterns
34
73
 
35
74
  ### ReAct Agent Loop
36
75
 
37
76
  Alternating reasoning and action steps
38
77
 
78
+ **When to use**: Interactive problem-solving, tool use, exploration
79
+
80
+ # REACT PATTERN:
81
+
82
+ """
83
+ The ReAct loop:
84
+ 1. Thought: Reason about what to do next
85
+ 2. Action: Choose and execute a tool
86
+ 3. Observation: Receive result
87
+ 4. Repeat until goal achieved
88
+
89
+ Key: Explicit reasoning traces make debugging possible
90
+ """
91
+
92
+ ## Basic ReAct Implementation
93
+ """
94
+ from langchain.agents import create_react_agent
95
+ from langchain_openai import ChatOpenAI
96
+
97
+ # Define the ReAct prompt template
98
+ react_prompt = '''
99
+ Answer the question using the following format:
100
+
101
+ Question: the input question
102
+ Thought: reason about what to do
103
+ Action: tool_name
104
+ Action Input: input to the tool
105
+ Observation: result of the action
106
+ ... (repeat Thought/Action/Observation as needed)
107
+ Thought: I now know the final answer
108
+ Final Answer: the answer
109
+ '''
110
+
111
+ # Create the agent
112
+ agent = create_react_agent(
113
+ llm=ChatOpenAI(model="gpt-4o"),
114
+ tools=tools,
115
+ prompt=react_prompt,
116
+ )
117
+
118
+ # Execute with step limit
119
+ result = agent.invoke(
120
+ {"input": query},
121
+ config={"max_iterations": 10} # Prevent runaway loops
122
+ )
123
+ """
124
+
125
+ ## LangGraph ReAct (Production)
126
+ """
127
+ from langgraph.prebuilt import create_react_agent
128
+ from langgraph.checkpoint.postgres import PostgresSaver
129
+
130
+ # Production checkpointer
131
+ checkpointer = PostgresSaver.from_conn_string(
132
+ os.environ["POSTGRES_URL"]
133
+ )
134
+
135
+ agent = create_react_agent(
136
+ model=llm,
137
+ tools=tools,
138
+ checkpointer=checkpointer, # Durable state
139
+ )
140
+
141
+ # Invoke with thread for state persistence
142
+ config = {"configurable": {"thread_id": "user-123"}}
143
+ result = agent.invoke({"messages": [query]}, config)
144
+ """
145
+
39
146
  ### Plan-Execute Pattern
40
147
 
41
148
  Separate planning phase from execution
42
149
 
150
+ **When to use**: Complex multi-step tasks, when full plan visibility matters
151
+
152
+ # PLAN-EXECUTE PATTERN:
153
+
154
+ """
155
+ Two-phase approach:
156
+ 1. Planning: Decompose goal into subtasks
157
+ 2. Execution: Execute subtasks, potentially re-plan
158
+
159
+ Advantages:
160
+ - Full visibility into plan before execution
161
+ - Can validate/modify plan with human
162
+ - Cleaner separation of concerns
163
+
164
+ Disadvantages:
165
+ - Less adaptive to mid-task discoveries
166
+ - Plan may become stale
167
+ """
168
+
169
+ ## LangGraph Plan-Execute
170
+ """
171
+ from langgraph.prebuilt import create_plan_and_execute_agent
172
+
173
+ # Planner creates the task list
174
+ planner_prompt = '''
175
+ For the given objective, create a step-by-step plan.
176
+ Each step should be atomic and actionable.
177
+ Format: numbered list of steps.
178
+ '''
179
+
180
+ # Executor handles individual steps
181
+ executor_prompt = '''
182
+ You are executing step {step_number} of the plan.
183
+ Previous results: {previous_results}
184
+ Current step: {current_step}
185
+ Execute this step using available tools.
186
+ '''
187
+
188
+ agent = create_plan_and_execute_agent(
189
+ planner=planner_llm,
190
+ executor=executor_llm,
191
+ tools=tools,
192
+ replan_on_error=True, # Re-plan if step fails
193
+ )
194
+
195
+ # Human approval of plan
196
+ config = {
197
+ "configurable": {
198
+ "thread_id": "task-456",
199
+ },
200
+ "interrupt_before": ["execute"], # Pause before execution
201
+ }
202
+
203
+ # First call creates plan
204
+ plan = agent.invoke({"objective": goal}, config)
205
+
206
+ # Review plan, then continue
207
+ if human_approves(plan):
208
+ result = agent.invoke(None, config) # Continue from checkpoint
209
+ """
210
+
211
+ ## Decomposition Strategies
212
+ """
213
+ # Decomposition-First: Plan everything, then execute
214
+ # Best for: Stable tasks, need full plan approval
215
+
216
+ # Interleaved: Plan one step, execute, repeat
217
+ # Best for: Dynamic tasks, learning as you go
218
+
219
+ def interleaved_execute(goal, max_steps=10):
220
+ state = {"goal": goal, "completed": [], "remaining": [goal]}
221
+
222
+ for step in range(max_steps):
223
+ # Plan next action based on current state
224
+ next_action = planner.plan_next(state)
225
+
226
+ if next_action == "DONE":
227
+ break
228
+
229
+ # Execute and update state
230
+ result = executor.execute(next_action)
231
+ state["completed"].append((next_action, result))
232
+
233
+ # Re-evaluate remaining work
234
+ state["remaining"] = planner.reassess(state)
235
+
236
+ return state
237
+ """
238
+
43
239
  ### Reflection Pattern
44
240
 
45
241
  Self-evaluation and iterative improvement
46
242
 
47
- ## Anti-Patterns
243
+ **When to use**: Quality matters, complex outputs, creative tasks
244
+
245
+ # REFLECTION PATTERN:
246
+
247
+ """
248
+ Self-correction loop:
249
+ 1. Generate initial output
250
+ 2. Evaluate against criteria
251
+ 3. Critique and identify issues
252
+ 4. Refine based on critique
253
+ 5. Repeat until satisfactory
254
+
255
+ Also called: Evaluator-Optimizer, Self-Critique
256
+ """
257
+
258
+ ## Basic Reflection
259
+ """
260
+ def reflect_and_improve(task, max_iterations=3):
261
+ # Initial generation
262
+ output = generator.generate(task)
263
+
264
+ for i in range(max_iterations):
265
+ # Evaluate output
266
+ critique = evaluator.critique(
267
+ task=task,
268
+ output=output,
269
+ criteria=[
270
+ "Correctness",
271
+ "Completeness",
272
+ "Clarity",
273
+ ]
274
+ )
275
+
276
+ if critique["passes_all"]:
277
+ return output
278
+
279
+ # Refine based on critique
280
+ output = generator.refine(
281
+ task=task,
282
+ previous_output=output,
283
+ critique=critique["feedback"],
284
+ )
285
+
286
+ return output # Best effort after max iterations
287
+ """
288
+
289
+ ## LangGraph Reflection
290
+ """
291
+ from langgraph.graph import StateGraph
292
+
293
+ def build_reflection_graph():
294
+ graph = StateGraph(ReflectionState)
295
+
296
+ # Nodes
297
+ graph.add_node("generate", generate_node)
298
+ graph.add_node("reflect", reflect_node)
299
+ graph.add_node("output", output_node)
300
+
301
+ # Edges
302
+ graph.add_edge("generate", "reflect")
303
+ graph.add_conditional_edges(
304
+ "reflect",
305
+ should_continue,
306
+ {
307
+ "continue": "generate", # Loop back
308
+ "end": "output",
309
+ }
310
+ )
311
+
312
+ return graph.compile()
313
+
314
+ def should_continue(state):
315
+ if state["iteration"] >= 3:
316
+ return "end"
317
+ if state["score"] >= 0.9:
318
+ return "end"
319
+ return "continue"
320
+ """
321
+
322
+ ## Separate Evaluator (More Robust)
323
+ """
324
+ # Use different model for evaluation to avoid self-bias
325
+ generator = ChatOpenAI(model="gpt-4o")
326
+ evaluator = ChatOpenAI(model="gpt-4o-mini") # Different perspective
327
+
328
+ # Or use specialized evaluators
329
+ from langchain.evaluation import load_evaluator
330
+ evaluator = load_evaluator("criteria", criteria="correctness")
331
+ """
332
+
333
+ ### Guardrailed Autonomy
334
+
335
+ Constrained agents with safety boundaries
336
+
337
+ **When to use**: Production systems, critical operations
338
+
339
+ # GUARDRAILED AUTONOMY:
340
+
341
+ """
342
+ Production agents need multiple safety layers:
343
+ 1. Input validation
344
+ 2. Action constraints
345
+ 3. Output validation
346
+ 4. Cost limits
347
+ 5. Human escalation
348
+ 6. Rollback capability
349
+ """
350
+
351
+ ## Multi-Layer Guardrails
352
+ """
353
+ class GuardedAgent:
354
+ def __init__(self, agent, config):
355
+ self.agent = agent
356
+ self.max_cost = config.get("max_cost_usd", 1.0)
357
+ self.max_steps = config.get("max_steps", 10)
358
+ self.allowed_actions = config.get("allowed_actions", [])
359
+ self.require_approval = config.get("require_approval", [])
360
+
361
+ async def execute(self, goal):
362
+ total_cost = 0
363
+ steps = 0
364
+
365
+ while steps < self.max_steps:
366
+ # Get next action
367
+ action = await self.agent.plan_next(goal)
368
+
369
+ # Validate action is allowed
370
+ if action.name not in self.allowed_actions:
371
+ raise ActionNotAllowedError(action.name)
372
+
373
+ # Check if approval needed
374
+ if action.name in self.require_approval:
375
+ approved = await self.request_human_approval(action)
376
+ if not approved:
377
+ return {"status": "rejected", "action": action}
378
+
379
+ # Estimate cost
380
+ estimated_cost = self.estimate_cost(action)
381
+ if total_cost + estimated_cost > self.max_cost:
382
+ raise CostLimitExceededError(total_cost)
383
+
384
+ # Execute with rollback capability
385
+ checkpoint = await self.save_checkpoint()
386
+ try:
387
+ result = await self.agent.execute(action)
388
+ total_cost += self.actual_cost(action)
389
+ steps += 1
390
+ except Exception as e:
391
+ await self.rollback_to(checkpoint)
392
+ raise
393
+
394
+ if result.is_complete:
395
+ break
396
+
397
+ return {"status": "complete", "total_cost": total_cost}
398
+ """
399
+
400
+ ## Least Privilege Principle
401
+ """
402
+ # Define minimal permissions per task type
403
+ TASK_PERMISSIONS = {
404
+ "research": ["web_search", "read_file"],
405
+ "coding": ["read_file", "write_file", "run_tests"],
406
+ "admin": ["all"], # Rarely grant this
407
+ }
408
+
409
+ def create_scoped_agent(task_type):
410
+ allowed = TASK_PERMISSIONS.get(task_type, [])
411
+ tools = [t for t in ALL_TOOLS if t.name in allowed]
412
+ return Agent(tools=tools)
413
+ """
414
+
415
+ ## Cost Control
416
+ """
417
+ # Context length grows quadratically in cost
418
+ # Double context = 4x cost
419
+
420
+ def trim_context(messages, max_tokens=4000):
421
+ # Keep system message and recent messages
422
+ system = messages[0]
423
+ recent = messages[-10:]
424
+
425
+ # Summarize middle if needed
426
+ if len(messages) > 11:
427
+ middle = messages[1:-10]
428
+ summary = summarize(middle)
429
+ return [system, summary] + recent
430
+
431
+ return messages
432
+ """
433
+
434
+ ### Durable Execution Pattern
435
+
436
+ Agents that survive failures and resume
437
+
438
+ **When to use**: Long-running tasks, production systems, multi-day processes
439
+
440
+ # DURABLE EXECUTION:
441
+
442
+ """
443
+ Production agents must:
444
+ - Survive server restarts
445
+ - Resume from exact point of failure
446
+ - Handle hours/days of runtime
447
+ - Allow human intervention mid-process
448
+
449
+ LangGraph 1.0 provides this natively.
450
+ """
451
+
452
+ ## LangGraph Checkpointing
453
+ """
454
+ from langgraph.checkpoint.postgres import PostgresSaver
455
+ from langgraph.graph import StateGraph
456
+
457
+ # Production checkpointer (not MemorySaver!)
458
+ checkpointer = PostgresSaver.from_conn_string(
459
+ os.environ["POSTGRES_URL"]
460
+ )
461
+
462
+ # Build graph with checkpointing
463
+ graph = StateGraph(AgentState)
464
+ # ... add nodes and edges ...
465
+
466
+ agent = graph.compile(checkpointer=checkpointer)
467
+
468
+ # Each invocation saves state
469
+ config = {"configurable": {"thread_id": "long-task-789"}}
470
+
471
+ # Start task
472
+ agent.invoke({"goal": complex_goal}, config)
473
+
474
+ # If server dies, resume later:
475
+ state = agent.get_state(config)
476
+ if not state.is_complete:
477
+ agent.invoke(None, config) # Continues from checkpoint
478
+ """
479
+
480
+ ## Human-in-the-Loop Interrupts
481
+ """
482
+ # Pause at specific nodes
483
+ agent = graph.compile(
484
+ checkpointer=checkpointer,
485
+ interrupt_before=["critical_action"], # Pause before
486
+ interrupt_after=["validation"], # Pause after
487
+ )
488
+
489
+ # First invocation pauses at interrupt
490
+ result = agent.invoke({"goal": goal}, config)
491
+
492
+ # Human reviews state
493
+ state = agent.get_state(config)
494
+ if human_approves(state):
495
+ # Continue from pause point
496
+ agent.invoke(None, config)
497
+ else:
498
+ # Modify state and continue
499
+ agent.update_state(config, {"approved": False})
500
+ agent.invoke(None, config)
501
+ """
502
+
503
+ ## Time-Travel Debugging
504
+ """
505
+ # LangGraph stores full history
506
+ history = list(agent.get_state_history(config))
507
+
508
+ # Go back to any previous state
509
+ past_state = history[5]
510
+ agent.update_state(config, past_state.values)
511
+
512
+ # Replay from that point with modifications
513
+ agent.invoke(None, config)
514
+ """
515
+
516
+ ## Sharp Edges
517
+
518
+ ### Error Probability Compounds Exponentially
519
+
520
+ Severity: CRITICAL
521
+
522
+ Situation: Building multi-step autonomous agents
523
+
524
+ Symptoms:
525
+ Agent works in demos but fails in production. Simple tasks succeed,
526
+ complex tasks fail mysteriously. Success rate drops dramatically
527
+ as task complexity increases. Users lose trust.
528
+
529
+ Why this breaks:
530
+ Each step has independent failure probability. A 95% success rate
531
+ per step sounds great until you realize:
532
+ - 5 steps: 77% success (0.95^5)
533
+ - 10 steps: 60% success (0.95^10)
534
+ - 20 steps: 36% success (0.95^20)
535
+
536
+ This is the fundamental limit of autonomous agents. Every additional
537
+ step multiplies failure probability.
538
+
539
+ Recommended fix:
540
+
541
+ ## Reduce step count
542
+ # Combine steps where possible
543
+ # Prefer fewer, more capable steps over many small ones
544
+
545
+ ## Increase per-step reliability
546
+ # Use structured outputs (JSON schemas)
547
+ # Add validation at each step
548
+ # Use better models for critical steps
549
+
550
+ ## Design for failure
551
+ class RobustAgent:
552
+ def execute_with_retry(self, step, max_retries=3):
553
+ for attempt in range(max_retries):
554
+ try:
555
+ result = step.execute()
556
+ if self.validate(result):
557
+ return result
558
+ except Exception as e:
559
+ if attempt == max_retries - 1:
560
+ raise
561
+ self.log_retry(step, attempt, e)
562
+
563
+ ## Break into checkpointed segments
564
+ # Human review at each segment
565
+ # Resume from last good checkpoint
566
+
567
+ ### API Costs Explode with Context Growth
568
+
569
+ Severity: CRITICAL
570
+
571
+ Situation: Running agents with growing conversation context
572
+
573
+ Symptoms:
574
+ $47 to close a single support ticket. Thousands in surprise API bills.
575
+ Agents getting slower as they run longer. Token counts exceeding
576
+ model limits.
577
+
578
+ Why this breaks:
579
+ Transformer costs scale quadratically with context length. Double
580
+ the context, quadruple the compute. A long-running agent that
581
+ re-sends its full conversation each turn can burn money exponentially.
582
+
583
+ Most agents append to context without trimming. Context grows:
584
+ - Turn 1: 500 tokens → $0.01
585
+ - Turn 10: 5000 tokens → $0.10
586
+ - Turn 50: 25000 tokens → $0.50
587
+ - Turn 100: 50000 tokens → $1.00+ per message
588
+
589
+ Recommended fix:
590
+
591
+ ## Set hard cost limits
592
+ class CostLimitedAgent:
593
+ MAX_COST_PER_TASK = 1.00 # USD
594
+
595
+ def __init__(self):
596
+ self.total_cost = 0
597
+
598
+ def before_call(self, estimated_tokens):
599
+ estimated_cost = self.estimate_cost(estimated_tokens)
600
+ if self.total_cost + estimated_cost > self.MAX_COST_PER_TASK:
601
+ raise CostLimitExceeded(
602
+ f"Would exceed ${self.MAX_COST_PER_TASK} limit"
603
+ )
604
+
605
+ def after_call(self, response):
606
+ self.total_cost += self.calculate_actual_cost(response)
607
+
608
+ ## Trim context aggressively
609
+ def trim_context(messages, max_tokens=4000):
610
+ # Keep: system prompt + last N messages
611
+ # Summarize: everything in between
612
+ if count_tokens(messages) <= max_tokens:
613
+ return messages
614
+
615
+ system = messages[0]
616
+ recent = messages[-5:]
617
+ middle = messages[1:-5]
48
618
 
49
- ### ❌ Unbounded Autonomy
619
+ if middle:
620
+ summary = summarize(middle) # Compress history
621
+ return [system, summary] + recent
50
622
 
51
- ### Trusting Agent Outputs
623
+ return [system] + recent
52
624
 
53
- ### General-Purpose Autonomy
625
+ ## Use streaming to track costs in real-time
626
+ ## Alert at 50% of budget, halt at 90%
54
627
 
55
- ## ⚠️ Sharp Edges
628
+ ### Demo Works But Production Fails
56
629
 
57
- | Issue | Severity | Solution |
58
- |-------|----------|----------|
59
- | Issue | critical | ## Reduce step count |
60
- | Issue | critical | ## Set hard cost limits |
61
- | Issue | critical | ## Test at scale before production |
62
- | Issue | high | ## Validate against ground truth |
63
- | Issue | high | ## Build robust API clients |
64
- | Issue | high | ## Least privilege principle |
65
- | Issue | medium | ## Track context usage |
66
- | Issue | medium | ## Structured logging |
630
+ Severity: CRITICAL
631
+
632
+ Situation: Moving from prototype to production
633
+
634
+ Symptoms:
635
+ Impressive demo to stakeholders. Months of failure in production.
636
+ Works for the founder's use case, fails for real users. Edge cases
637
+ overwhelm the system.
638
+
639
+ Why this breaks:
640
+ Demos show the happy path with curated inputs. Production means:
641
+ - Unexpected inputs (typos, ambiguity, adversarial)
642
+ - Scale (1000 users, not 3)
643
+ - Reliability (99.9% uptime, not "usually works")
644
+ - Edge cases (the 1% that breaks everything)
645
+
646
+ The methodology is questionable, but the core problem is real.
647
+ The gap between a working demo and a reliable production system
648
+ is where projects die.
649
+
650
+ Recommended fix:
651
+
652
+ ## Test at scale before production
653
+ # Run 1000+ test cases, not 10
654
+ # Measure P95/P99 success rate, not average
655
+ # Include adversarial inputs
656
+
657
+ ## Build observability first
658
+ import structlog
659
+ logger = structlog.get_logger()
660
+
661
+ class ObservableAgent:
662
+ def execute(self, task):
663
+ with logger.bind(task_id=task.id):
664
+ logger.info("task_started")
665
+ try:
666
+ result = self._execute(task)
667
+ logger.info("task_completed", result=result)
668
+ return result
669
+ except Exception as e:
670
+ logger.error("task_failed", error=str(e))
671
+ raise
672
+
673
+ ## Have escape hatches
674
+ # Human takeover when confidence < threshold
675
+ # Graceful degradation to simpler behavior
676
+ # "I don't know" is a valid response
677
+
678
+ ## Deploy incrementally
679
+ # 1% of traffic, then 10%, then 50%
680
+ # Monitor error rates at each stage
681
+
682
+ ### Agent Fabricates Data When Stuck
683
+
684
+ Severity: HIGH
685
+
686
+ Situation: Agent can't complete task with available information
687
+
688
+ Symptoms:
689
+ Agent invents plausible-looking data. Fake restaurant names on expense
690
+ reports. Made-up statistics in reports. Confident answers that are
691
+ completely wrong.
692
+
693
+ Why this breaks:
694
+ LLMs are trained to be helpful and produce plausible outputs. When
695
+ stuck, they don't say "I can't do this" - they fabricate. Autonomous
696
+ agents compound this by acting on fabricated data without human review.
697
+
698
+ The agent that fabricated expense entries was trying to meet its goal
699
+ (complete the expense report). It "solved" the problem by inventing data.
700
+
701
+ Recommended fix:
702
+
703
+ ## Validate against ground truth
704
+ def validate_expense(expense):
705
+ # Cross-check with external sources
706
+ if expense.restaurant:
707
+ if not verify_restaurant_exists(expense.restaurant):
708
+ raise ValidationError("Restaurant not found")
709
+
710
+ # Check for suspicious patterns
711
+ if expense.amount == round(expense.amount, -1):
712
+ flag_for_review("Suspiciously round amount")
713
+
714
+ ## Require evidence
715
+ system_prompt = '''
716
+ For every factual claim, cite the specific tool output that
717
+ supports it. If you cannot find supporting evidence, say
718
+ "I could not verify this" rather than guessing.
719
+ '''
720
+
721
+ ## Use structured outputs
722
+ from pydantic import BaseModel
723
+
724
+ class VerifiedClaim(BaseModel):
725
+ claim: str
726
+ source: str # Must reference tool output
727
+ confidence: float
728
+
729
+ ## Detect uncertainty
730
+ # Train to output confidence scores
731
+ # Flag low-confidence outputs for human review
732
+ # Never auto-execute on uncertain data
733
+
734
+ ### Integration Is Where Agents Die
735
+
736
+ Severity: HIGH
737
+
738
+ Situation: Connecting agent to external systems
739
+
740
+ Symptoms:
741
+ Works with mock APIs, fails with real ones. Rate limits cause crashes.
742
+ Auth tokens expire mid-task. Data format mismatches. Partial failures
743
+ leave systems in inconsistent state.
744
+
745
+ Why this breaks:
746
+ The companies promising "autonomous agents that integrate with your
747
+ entire tech stack" haven't built production systems at scale.
748
+ Real integrations have:
749
+ - Rate limits (429 errors mid-task)
750
+ - Auth complexity (OAuth refresh, token expiry)
751
+ - Data format variations (API v1 vs v2)
752
+ - Partial failures (webhook received, processing failed)
753
+ - Eventual consistency (data not immediately available)
754
+
755
+ Recommended fix:
756
+
757
+ ## Build robust API clients
758
+ from tenacity import retry, stop_after_attempt, wait_exponential
759
+
760
+ class RobustAPIClient:
761
+ @retry(
762
+ stop=stop_after_attempt(3),
763
+ wait=wait_exponential(multiplier=1, min=4, max=60)
764
+ )
765
+ async def call(self, endpoint, data):
766
+ response = await self.client.post(endpoint, json=data)
767
+ if response.status_code == 429:
768
+ retry_after = response.headers.get("Retry-After", 60)
769
+ await asyncio.sleep(int(retry_after))
770
+ raise RateLimitError()
771
+ return response
772
+
773
+ ## Handle auth lifecycle
774
+ class TokenManager:
775
+ def __init__(self):
776
+ self.token = None
777
+ self.expires_at = None
778
+
779
+ async def get_token(self):
780
+ if self.is_expired():
781
+ self.token = await self.refresh_token()
782
+ return self.token
783
+
784
+ def is_expired(self):
785
+ buffer = timedelta(minutes=5) # Refresh early
786
+ return datetime.now() > (self.expires_at - buffer)
787
+
788
+ ## Use idempotency keys
789
+ # Every external action should be idempotent
790
+ # If agent retries, external system handles duplicate
791
+
792
+ ## Design for partial failure
793
+ # Each step is independently recoverable
794
+ # Checkpoint before external calls
795
+ # Rollback capability for each integration
796
+
797
+ ### Agent Takes Dangerous Actions
798
+
799
+ Severity: HIGH
800
+
801
+ Situation: Agent with broad permissions
802
+
803
+ Symptoms:
804
+ Agent deletes production data. Sends emails to wrong recipients.
805
+ Makes purchases without approval. Modifies settings it shouldn't.
806
+ Actions that can't be undone.
807
+
808
+ Why this breaks:
809
+ Agents optimize for their goal. Without guardrails, they'll take the
810
+ shortest path - even if that path is destructive. An agent told to
811
+ "clean up the database" might interpret that as "delete everything."
812
+
813
+ Broad permissions + autonomy + goal optimization = danger.
814
+
815
+ Recommended fix:
816
+
817
+ ## Least privilege principle
818
+ PERMISSIONS = {
819
+ "research_agent": ["read_web", "read_docs"],
820
+ "code_agent": ["read_file", "write_file", "run_tests"],
821
+ "email_agent": ["read_email", "draft_email"], # NOT send
822
+ "admin_agent": ["all"], # Rarely used
823
+ }
824
+
825
+ ## Separate read/write permissions
826
+ # Agent can read anything
827
+ # Write requires explicit approval
828
+
829
+ ## Dangerous actions require confirmation
830
+ DANGEROUS_ACTIONS = [
831
+ "delete_*",
832
+ "send_email",
833
+ "transfer_money",
834
+ "modify_production",
835
+ "revoke_access",
836
+ ]
837
+
838
+ async def execute_action(action):
839
+ if matches_dangerous_pattern(action):
840
+ approval = await request_human_approval(action)
841
+ if not approval:
842
+ return ActionRejected(action)
843
+ return await actually_execute(action)
844
+
845
+ ## Dry-run mode for testing
846
+ # Agent describes what it would do
847
+ # Human approves the plan
848
+ # Then agent executes
849
+
850
+ ## Audit logging for everything
851
+ # Every action logged with context
852
+ # Who authorized it
853
+ # What changed
854
+ # How to reverse it
855
+
856
+ ### Agent Runs Out of Context Window
857
+
858
+ Severity: MEDIUM
859
+
860
+ Situation: Long-running agent tasks
861
+
862
+ Symptoms:
863
+ Agent forgets earlier instructions. Contradicts itself. Loses track
864
+ of the goal. Starts repeating itself. Model errors about token limits.
865
+
866
+ Why this breaks:
867
+ Every message, observation, and thought consumes context. Long tasks
868
+ exhaust the window. When context is truncated:
869
+ - System prompt gets dropped
870
+ - Early important context lost
871
+ - Agent loses coherence
872
+
873
+ Recommended fix:
874
+
875
+ ## Track context usage
876
+ class ContextManager:
877
+ def __init__(self, max_tokens=100000):
878
+ self.max_tokens = max_tokens
879
+ self.messages = []
880
+
881
+ def add(self, message):
882
+ self.messages.append(message)
883
+ self.maybe_compact()
884
+
885
+ def maybe_compact(self):
886
+ if self.token_count() > self.max_tokens * 0.8:
887
+ self.compact()
888
+
889
+ def compact(self):
890
+ # Always keep: system prompt
891
+ system = self.messages[0]
892
+
893
+ # Always keep: last N messages
894
+ recent = self.messages[-10:]
895
+
896
+ # Summarize: everything else
897
+ middle = self.messages[1:-10]
898
+ if middle:
899
+ summary = summarize_messages(middle)
900
+ self.messages = [system, summary] + recent
901
+
902
+ ## Use external memory
903
+ # Don't keep everything in context
904
+ # Store in vector DB, retrieve when needed
905
+ # See agent-memory-systems skill
906
+
907
+ ## Hierarchical summarization
908
+ # Recent: full detail
909
+ # Medium: key points
910
+ # Old: compressed summary
911
+
912
+ ### Can't Debug What You Can't See
913
+
914
+ Severity: MEDIUM
915
+
916
+ Situation: Agent fails mysteriously
917
+
918
+ Symptoms:
919
+ "It just didn't work." No idea why agent failed. Can't reproduce
920
+ issues. Users report problems you can't explain. Debugging is
921
+ guesswork.
922
+
923
+ Why this breaks:
924
+ Agents make dozens of internal decisions. Without visibility into
925
+ each step, you're blind to failure modes. Production debugging
926
+ without traces is impossible.
927
+
928
+ Recommended fix:
929
+
930
+ ## Structured logging
931
+ import structlog
932
+
933
+ logger = structlog.get_logger()
934
+
935
+ class TracedAgent:
936
+ def think(self, context):
937
+ with logger.bind(step="think"):
938
+ thought = self.llm.generate(context)
939
+ logger.info("thought_generated",
940
+ thought=thought,
941
+ tokens=count_tokens(thought)
942
+ )
943
+ return thought
944
+
945
+ def act(self, action):
946
+ with logger.bind(step="act", action=action.name):
947
+ logger.info("action_started")
948
+ try:
949
+ result = action.execute()
950
+ logger.info("action_completed", result=result)
951
+ return result
952
+ except Exception as e:
953
+ logger.error("action_failed", error=str(e))
954
+ raise
955
+
956
+ ## Use LangSmith or similar
957
+ from langsmith import trace
958
+
959
+ @trace
960
+ def agent_step(state):
961
+ # Automatically traced with inputs/outputs
962
+ return next_state
963
+
964
+ ## Save full traces
965
+ # Every step, every decision
966
+ # Inputs and outputs
967
+ # Latency at each step
968
+ # Token usage
969
+
970
+ ## Validation Checks
971
+
972
+ ### Agent Loop Without Step Limit
973
+
974
+ Severity: ERROR
975
+
976
+ Autonomous agents must have maximum step limits
977
+
978
+ Message: Agent loop without step limit. Add max_steps to prevent infinite loops.
979
+
980
+ ### No Cost Tracking or Limits
981
+
982
+ Severity: ERROR
983
+
984
+ Agents should track and limit API costs
985
+
986
+ Message: Agent uses LLM without cost tracking. Add cost limits to prevent runaway spending.
987
+
988
+ ### Agent Without Timeout
989
+
990
+ Severity: WARNING
991
+
992
+ Long-running agents need timeouts
993
+
994
+ Message: Agent invocation without timeout. Add timeout to prevent hung tasks.
995
+
996
+ ### MemorySaver Used in Production
997
+
998
+ Severity: ERROR
999
+
1000
+ MemorySaver is for development only
1001
+
1002
+ Message: MemorySaver is not persistent. Use PostgresSaver or SqliteSaver for production.
1003
+
1004
+ ### Long-Running Agent Without Checkpointing
1005
+
1006
+ Severity: WARNING
1007
+
1008
+ Agents that run multiple steps need checkpointing
1009
+
1010
+ Message: Multi-step agent without checkpointing. Add checkpointer for durability.
1011
+
1012
+ ### Agent Without Thread ID
1013
+
1014
+ Severity: WARNING
1015
+
1016
+ Checkpointed agents need unique thread IDs
1017
+
1018
+ Message: Agent invocation without thread_id. State won't persist correctly.
1019
+
1020
+ ### Using Agent Output Without Validation
1021
+
1022
+ Severity: WARNING
1023
+
1024
+ Agent outputs should be validated before use
1025
+
1026
+ Message: Agent output used without validation. Validate before acting on results.
1027
+
1028
+ ### Agent Without Structured Output
1029
+
1030
+ Severity: INFO
1031
+
1032
+ Structured outputs are more reliable
1033
+
1034
+ Message: Consider using structured outputs (Pydantic) for more reliable parsing.
1035
+
1036
+ ### Agent Without Error Recovery
1037
+
1038
+ Severity: WARNING
1039
+
1040
+ Agents should handle and recover from errors
1041
+
1042
+ Message: Agent call without error handling. Add try/catch or error handler.
1043
+
1044
+ ### Destructive Actions Without Rollback
1045
+
1046
+ Severity: WARNING
1047
+
1048
+ Actions that modify state should be reversible
1049
+
1050
+ Message: Destructive action without rollback capability. Save state before modification.
1051
+
1052
+ ## Collaboration
1053
+
1054
+ ### Delegation Triggers
1055
+
1056
+ - user needs multi-agent coordination -> multi-agent-orchestration (Multiple agents working together)
1057
+ - user needs to test/evaluate agent -> agent-evaluation (Benchmarking and testing)
1058
+ - user needs tools for agent -> agent-tool-builder (Tool design and implementation)
1059
+ - user needs persistent memory -> agent-memory-systems (Long-term memory architecture)
1060
+ - user needs workflow automation -> workflow-automation (When agent is overkill for the task)
1061
+ - user needs computer control -> computer-use-agents (GUI automation, screen interaction)
67
1062
 
68
1063
  ## Related Skills
69
1064
 
70
1065
  Works well with: `agent-tool-builder`, `agent-memory-systems`, `multi-agent-orchestration`, `agent-evaluation`
71
1066
 
72
1067
  ## When to Use
73
- This skill is applicable to execute the workflow or actions described in the overview.
1068
+
1069
+ - User mentions or implies: autonomous agent
1070
+ - User mentions or implies: autogpt
1071
+ - User mentions or implies: babyagi
1072
+ - User mentions or implies: self-prompting
1073
+ - User mentions or implies: goal decomposition
1074
+ - User mentions or implies: react pattern
1075
+ - User mentions or implies: agent loop
1076
+ - User mentions or implies: self-correcting agent
1077
+ - User mentions or implies: reflection agent
1078
+ - User mentions or implies: langgraph
1079
+ - User mentions or implies: agentic ai
1080
+ - User mentions or implies: agent planning