@bgicli/bgicli 2.2.7 → 2.2.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (117) hide show
  1. package/data/skills/anthropic-algorithmic-art/SKILL.md +405 -0
  2. package/data/skills/anthropic-canvas-design/SKILL.md +130 -0
  3. package/data/skills/anthropic-claude-api/SKILL.md +243 -0
  4. package/data/skills/anthropic-doc-coauthoring/SKILL.md +375 -0
  5. package/data/skills/anthropic-docx/SKILL.md +590 -0
  6. package/data/skills/anthropic-frontend-design/SKILL.md +42 -0
  7. package/data/skills/anthropic-internal-comms/SKILL.md +32 -0
  8. package/data/skills/anthropic-mcp-builder/SKILL.md +236 -0
  9. package/data/skills/anthropic-pdf/SKILL.md +314 -0
  10. package/data/skills/anthropic-pptx/SKILL.md +232 -0
  11. package/data/skills/anthropic-skill-creator/SKILL.md +485 -0
  12. package/data/skills/anthropic-webapp-testing/SKILL.md +96 -0
  13. package/data/skills/anthropic-xlsx/SKILL.md +292 -0
  14. package/data/skills/arxiv-database/SKILL.md +362 -0
  15. package/data/skills/astropy/SKILL.md +329 -0
  16. package/data/skills/ctx-advanced-evaluation/SKILL.md +402 -0
  17. package/data/skills/ctx-bdi-mental-states/SKILL.md +311 -0
  18. package/data/skills/ctx-context-compression/SKILL.md +272 -0
  19. package/data/skills/ctx-context-degradation/SKILL.md +206 -0
  20. package/data/skills/ctx-context-fundamentals/SKILL.md +201 -0
  21. package/data/skills/ctx-context-optimization/SKILL.md +195 -0
  22. package/data/skills/ctx-evaluation/SKILL.md +251 -0
  23. package/data/skills/ctx-filesystem-context/SKILL.md +287 -0
  24. package/data/skills/ctx-hosted-agents/SKILL.md +260 -0
  25. package/data/skills/ctx-memory-systems/SKILL.md +225 -0
  26. package/data/skills/ctx-multi-agent-patterns/SKILL.md +257 -0
  27. package/data/skills/ctx-project-development/SKILL.md +291 -0
  28. package/data/skills/ctx-tool-design/SKILL.md +271 -0
  29. package/data/skills/dhdna-profiler/SKILL.md +162 -0
  30. package/data/skills/generate-image/SKILL.md +183 -0
  31. package/data/skills/geomaster/SKILL.md +365 -0
  32. package/data/skills/get-available-resources/SKILL.md +275 -0
  33. package/data/skills/hamelsmu-build-review-interface/SKILL.md +96 -0
  34. package/data/skills/hamelsmu-error-analysis/SKILL.md +164 -0
  35. package/data/skills/hamelsmu-eval-audit/SKILL.md +183 -0
  36. package/data/skills/hamelsmu-evaluate-rag/SKILL.md +177 -0
  37. package/data/skills/hamelsmu-generate-synthetic-data/SKILL.md +131 -0
  38. package/data/skills/hamelsmu-validate-evaluator/SKILL.md +212 -0
  39. package/data/skills/hamelsmu-write-judge-prompt/SKILL.md +144 -0
  40. package/data/skills/hf-cli/SKILL.md +174 -0
  41. package/data/skills/hf-mcp/SKILL.md +178 -0
  42. package/data/skills/hugging-face-dataset-viewer/SKILL.md +121 -0
  43. package/data/skills/hugging-face-datasets/SKILL.md +542 -0
  44. package/data/skills/hugging-face-evaluation/SKILL.md +651 -0
  45. package/data/skills/hugging-face-jobs/SKILL.md +1042 -0
  46. package/data/skills/hugging-face-model-trainer/SKILL.md +717 -0
  47. package/data/skills/hugging-face-paper-pages/SKILL.md +239 -0
  48. package/data/skills/hugging-face-paper-publisher/SKILL.md +624 -0
  49. package/data/skills/hugging-face-tool-builder/SKILL.md +110 -0
  50. package/data/skills/hugging-face-trackio/SKILL.md +115 -0
  51. package/data/skills/hugging-face-vision-trainer/SKILL.md +593 -0
  52. package/data/skills/huggingface-gradio/SKILL.md +245 -0
  53. package/data/skills/matlab/SKILL.md +376 -0
  54. package/data/skills/modal/SKILL.md +381 -0
  55. package/data/skills/openai-cloudflare-deploy/SKILL.md +224 -0
  56. package/data/skills/openai-develop-web-game/SKILL.md +149 -0
  57. package/data/skills/openai-doc/SKILL.md +80 -0
  58. package/data/skills/openai-figma/SKILL.md +42 -0
  59. package/data/skills/openai-figma-implement-design/SKILL.md +264 -0
  60. package/data/skills/openai-gh-address-comments/SKILL.md +25 -0
  61. package/data/skills/openai-gh-fix-ci/SKILL.md +69 -0
  62. package/data/skills/openai-imagegen/SKILL.md +174 -0
  63. package/data/skills/openai-jupyter-notebook/SKILL.md +107 -0
  64. package/data/skills/openai-linear/SKILL.md +87 -0
  65. package/data/skills/openai-netlify-deploy/SKILL.md +247 -0
  66. package/data/skills/openai-notion-knowledge-capture/SKILL.md +56 -0
  67. package/data/skills/openai-notion-meeting-intelligence/SKILL.md +60 -0
  68. package/data/skills/openai-notion-research-documentation/SKILL.md +59 -0
  69. package/data/skills/openai-notion-spec-to-implementation/SKILL.md +58 -0
  70. package/data/skills/openai-openai-docs/SKILL.md +69 -0
  71. package/data/skills/openai-pdf/SKILL.md +67 -0
  72. package/data/skills/openai-playwright/SKILL.md +147 -0
  73. package/data/skills/openai-render-deploy/SKILL.md +479 -0
  74. package/data/skills/openai-screenshot/SKILL.md +267 -0
  75. package/data/skills/openai-security-best-practices/SKILL.md +86 -0
  76. package/data/skills/openai-security-ownership-map/SKILL.md +206 -0
  77. package/data/skills/openai-security-threat-model/SKILL.md +81 -0
  78. package/data/skills/openai-sentry/SKILL.md +123 -0
  79. package/data/skills/openai-sora/SKILL.md +178 -0
  80. package/data/skills/openai-speech/SKILL.md +144 -0
  81. package/data/skills/openai-spreadsheet/SKILL.md +145 -0
  82. package/data/skills/openai-transcribe/SKILL.md +81 -0
  83. package/data/skills/openai-vercel-deploy/SKILL.md +77 -0
  84. package/data/skills/openai-yeet/SKILL.md +28 -0
  85. package/data/skills/pennylane/SKILL.md +224 -0
  86. package/data/skills/polars-bio/SKILL.md +374 -0
  87. package/data/skills/primekg/SKILL.md +97 -0
  88. package/data/skills/pymatgen/SKILL.md +689 -0
  89. package/data/skills/qiskit/SKILL.md +273 -0
  90. package/data/skills/qutip/SKILL.md +316 -0
  91. package/data/skills/recursive-decomposition/SKILL.md +185 -0
  92. package/data/skills/rowan/SKILL.md +427 -0
  93. package/data/skills/scholar-evaluation/SKILL.md +298 -0
  94. package/data/skills/sentry-create-alert/SKILL.md +210 -0
  95. package/data/skills/sentry-fix-issues/SKILL.md +126 -0
  96. package/data/skills/sentry-pr-code-review/SKILL.md +105 -0
  97. package/data/skills/sentry-python-sdk/SKILL.md +317 -0
  98. package/data/skills/sentry-setup-ai-monitoring/SKILL.md +217 -0
  99. package/data/skills/stable-baselines3/SKILL.md +297 -0
  100. package/data/skills/sympy/SKILL.md +498 -0
  101. package/data/skills/trailofbits-ask-questions-if-underspecified/SKILL.md +85 -0
  102. package/data/skills/trailofbits-audit-context-building/SKILL.md +302 -0
  103. package/data/skills/trailofbits-differential-review/SKILL.md +220 -0
  104. package/data/skills/trailofbits-insecure-defaults/SKILL.md +117 -0
  105. package/data/skills/trailofbits-modern-python/SKILL.md +333 -0
  106. package/data/skills/trailofbits-property-based-testing/SKILL.md +123 -0
  107. package/data/skills/trailofbits-semgrep-rule-creator/SKILL.md +172 -0
  108. package/data/skills/trailofbits-sharp-edges/SKILL.md +292 -0
  109. package/data/skills/trailofbits-variant-analysis/SKILL.md +142 -0
  110. package/data/skills/transformers.js/SKILL.md +637 -0
  111. package/data/skills/writing/SKILL.md +419 -0
  112. package/data/workflows/survival-analysis-clinical/SKILL.md +348 -0
  113. package/data/workflows/survival-analysis-clinical/scripts/full_workflow.R +95 -0
  114. package/data/workflows/survival-analysis-clinical/scripts/load_example_data.R +65 -0
  115. package/data/workflows/survival-analysis-clinical/scripts/plot_forest.R +46 -0
  116. package/dist/bgi.js +1608 -233
  117. package/package.json +45 -45
@@ -0,0 +1,257 @@
1
+ ---
2
+ name: multi-agent-patterns
3
+ description: This skill should be used when the user asks to "design multi-agent system", "implement supervisor pattern", "create swarm architecture", "coordinate multiple agents", or mentions multi-agent patterns, context isolation, agent handoffs, sub-agents, or parallel agent execution.
4
+ ---
5
+
6
+ # Multi-Agent Architecture Patterns
7
+
8
+ Multi-agent architectures distribute work across multiple language model instances, each with its own context window. When designed well, this distribution enables capabilities beyond single-agent limits. When designed poorly, it introduces coordination overhead that negates benefits. The critical insight is that sub-agents exist primarily to isolate context, not to anthropomorphize role division.
9
+
10
+ ## When to Activate
11
+
12
+ Activate this skill when:
13
+ - Single-agent context limits constrain task complexity
14
+ - Tasks decompose naturally into parallel subtasks
15
+ - Different subtasks require different tool sets or system prompts
16
+ - Building systems that must handle multiple domains simultaneously
17
+ - Scaling agent capabilities beyond single-context limits
18
+ - Designing production agent systems with multiple specialized components
19
+
20
+ ## Core Concepts
21
+
22
+ Use multi-agent patterns when a single agent's context window cannot hold all task-relevant information. Context isolation is the primary benefit — each agent operates in a clean context without accumulated noise from other subtasks, preventing the telephone game problem where information degrades through repeated summarization.
23
+
24
+ Choose among three dominant patterns based on coordination needs, not organizational metaphor:
25
+
26
+ - **Supervisor/orchestrator** — Use for centralized control when tasks have clear decomposition and human oversight matters. A single coordinator delegates to specialists and synthesizes results.
27
+ - **Peer-to-peer/swarm** — Use for flexible exploration when rigid planning is counterproductive. Any agent can transfer control to any other through explicit handoff mechanisms.
28
+ - **Hierarchical** — Use for large-scale projects with layered abstraction (strategy, planning, execution). Each layer operates at a different level of detail with its own context structure.
29
+
30
+ Design every multi-agent system around explicit coordination protocols, consensus mechanisms that resist sycophancy, and failure handling that prevents error propagation cascades.
31
+
32
+ ## Detailed Topics
33
+
34
+ ### Why Multi-Agent Architectures
35
+
36
+ **The Context Bottleneck**
37
+ Reach for multi-agent architectures when a single agent's context fills with accumulated history, retrieved documents, and tool outputs to the point where performance degrades. Recognize three degradation signals: the lost-in-middle effect (attention weakens for mid-context content), attention scarcity (too many competing items), and context poisoning (irrelevant content displaces useful content).
38
+
39
+ Partition work across multiple context windows so each agent operates in a clean context focused on its subtask. Aggregate results at a coordination layer without any single context bearing the full burden.
40
+
41
+ **The Token Economics Reality**
42
+ Budget for substantially higher token costs. Production data shows multi-agent systems run at approximately 15x the token cost of a single-agent chat:
43
+
44
+ | Architecture | Token Multiplier | Use Case |
45
+ |--------------|------------------|----------|
46
+ | Single agent chat | 1x baseline | Simple queries |
47
+ | Single agent with tools | ~4x baseline | Tool-using tasks |
48
+ | Multi-agent system | ~15x baseline | Complex research/coordination |
49
+
50
+ Research on the BrowseComp evaluation found that three factors explain 95% of performance variance: token usage (80% of variance), number of tool calls, and model choice. This validates distributing work across agents with separate context windows to add capacity for parallel reasoning.
51
+
52
+ Prioritize model selection alongside architecture design — upgrading to better models often provides larger performance gains than doubling token budgets. BrowseComp data shows that model quality improvements frequently outperform raw token increases. Treat model selection and multi-agent architecture as complementary strategies.
53
+
54
+ **The Parallelization Argument**
55
+ Assign parallelizable subtasks to dedicated agents with fresh contexts rather than processing them sequentially in a single agent. A research task requiring searches across multiple independent sources, analysis of different documents, or comparison of competing approaches benefits from parallel execution. Total real-world time approaches the duration of the longest subtask rather than the sum of all subtasks.
56
+
57
+ **The Specialization Argument**
58
+ Configure each agent with only the system prompt, tools, and context it needs for its specific subtask. A general-purpose agent must carry all possible configurations in context, diluting attention. Specialized agents carry only what they need, operating with lean context optimized for their domain. Route from a coordinator to specialized agents to achieve specialization without combinatorial explosion.
59
+
60
+ ### Architectural Patterns
61
+
62
+ **Pattern 1: Supervisor/Orchestrator**
63
+ Deploy a central agent that maintains global state and trajectory, decomposes user objectives into subtasks, and routes to appropriate workers.
64
+
65
+ ```
66
+ User Query -> Supervisor -> [Specialist, Specialist, Specialist] -> Aggregation -> Final Output
67
+ ```
68
+
69
+ Choose this pattern when: tasks have clear decomposition, coordination across domains is needed, or human oversight is important.
70
+
71
+ Expect these trade-offs: strict workflow control and easier human-in-the-loop interventions, but the supervisor context becomes a bottleneck, supervisor failures cascade to all workers, and the "telephone game" problem emerges where supervisors paraphrase sub-agent responses incorrectly.
72
+
73
+ **The Telephone Game Problem and Solution**
74
+ Anticipate that supervisor architectures initially perform approximately 50% worse than optimized versions due to the telephone game problem (LangGraph benchmarks). Supervisors paraphrase sub-agent responses, losing fidelity with each pass.
75
+
76
+ Fix this by implementing a `forward_message` tool that allows sub-agents to pass responses directly to users:
77
+
78
+ ```python
79
+ def forward_message(message: str, to_user: bool = True):
80
+ """
81
+ Forward sub-agent response directly to user without supervisor synthesis.
82
+
83
+ Use when:
84
+ - Sub-agent response is final and complete
85
+ - Supervisor synthesis would lose important details
86
+ - Response format must be preserved exactly
87
+ """
88
+ if to_user:
89
+ return {"type": "direct_response", "content": message}
90
+ return {"type": "supervisor_input", "content": message}
91
+ ```
92
+
93
+ Prefer swarm architectures over supervisors when sub-agents can respond directly to users, as this eliminates translation errors entirely.
94
+
95
+ **Pattern 2: Peer-to-Peer/Swarm**
96
+ Remove central control and allow agents to communicate directly based on predefined protocols. Any agent transfers control to any other through explicit handoff mechanisms.
97
+
98
+ ```python
99
+ def transfer_to_agent_b():
100
+ return agent_b # Handoff via function return
101
+
102
+ agent_a = Agent(
103
+ name="Agent A",
104
+ functions=[transfer_to_agent_b]
105
+ )
106
+ ```
107
+
108
+ Choose this pattern when: tasks require flexible exploration, rigid planning is counterproductive, or requirements emerge dynamically and defy upfront decomposition.
109
+
110
+ Expect these trade-offs: no single point of failure and effective breadth-first scaling, but coordination complexity increases with agent count, divergence risk rises without a central state keeper, and robust convergence constraints become essential.
111
+
112
+ Define explicit handoff protocols with state passing. Ensure agents communicate their context needs to receiving agents.
113
+
114
+ **Pattern 3: Hierarchical**
115
+ Organize agents into layers of abstraction: strategy (goal definition), planning (task decomposition), and execution (atomic tasks).
116
+
117
+ ```
118
+ Strategy Layer (Goal Definition) -> Planning Layer (Task Decomposition) -> Execution Layer (Atomic Tasks)
119
+ ```
120
+
121
+ Choose this pattern when: projects have clear hierarchical structure, workflows involve management layers, or tasks require both high-level planning and detailed execution.
122
+
123
+ Expect these trade-offs: clear separation of concerns and support for different context structures at different levels, but coordination overhead between layers, potential strategy-execution misalignment, and complex error propagation paths.
124
+
125
+ ### Context Isolation as Design Principle
126
+
127
+ Treat context isolation as the primary purpose of multi-agent architectures. Each sub-agent should operate in a clean context window focused on its subtask without carrying accumulated context from other subtasks.
128
+
129
+ **Isolation Mechanisms**
130
+ Select the right isolation mechanism for each subtask:
131
+
132
+ - **Full context delegation** — Share the planner's entire context with the sub-agent. Use for complex tasks where the sub-agent needs complete understanding. The sub-agent has its own tools and instructions but receives full context for its decisions. Note: this partially defeats the purpose of context isolation.
133
+ - **Instruction passing** — Create instructions via function call; the sub-agent receives only what it needs. Use for simple, well-defined subtasks. Maintains isolation but limits sub-agent flexibility.
134
+ - **File system memory** — Agents read and write to persistent storage. Use for complex tasks requiring shared state. The file system serves as the coordination mechanism, avoiding context bloat from shared state passing. Introduces latency and consistency challenges but scales better than message-passing.
135
+
136
+ Choose based on task complexity, coordination needs, and acceptable latency. Default to instruction passing and escalate to file system memory when shared state is needed. Avoid full context delegation unless the subtask genuinely requires it.
137
+
138
+ ### Consensus and Coordination
139
+
140
+ **The Voting Problem**
141
+ Avoid simple majority voting — it treats hallucinations from weak models as equal to reasoning from strong models. Without intervention, multi-agent discussions devolve into consensus on false premises due to inherent bias toward agreement.
142
+
143
+ **Weighted Voting**
144
+ Weight agent votes by confidence or expertise. Agents with higher confidence or domain expertise should carry more weight in final decisions.
145
+
146
+ **Debate Protocols**
147
+ Structure agents to critique each other's outputs over multiple rounds. Adversarial critique often yields higher accuracy on complex reasoning than collaborative consensus. Guard against sycophantic convergence where agents agree to be agreeable rather than correct.
148
+
149
+ **Trigger-Based Intervention**
150
+ Monitor multi-agent interactions for behavioral markers. Activate stall triggers when discussions make no progress. Detect sycophancy triggers when agents mimic each other's answers without unique reasoning.
151
+
152
+ ### Framework Considerations
153
+
154
+ Different frameworks implement these patterns with different philosophies. LangGraph uses graph-based state machines with explicit nodes and edges. AutoGen uses conversational/event-driven patterns with GroupChat. CrewAI uses role-based process flows with hierarchical crew structures.
155
+
156
+ ## Practical Guidance
157
+
158
+ ### Failure Modes and Mitigations
159
+
160
+ **Failure: Supervisor Bottleneck**
161
+ The supervisor accumulates context from all workers, becoming susceptible to saturation and degradation.
162
+
163
+ Mitigate by constraining worker output schemas so workers return only distilled summaries. Use checkpointing to persist supervisor state without carrying full history in context.
164
+
165
+ **Failure: Coordination Overhead**
166
+ Agent communication consumes tokens and introduces latency. Complex coordination can negate parallelization benefits.
167
+
168
+ Mitigate by minimizing communication through clear handoff protocols. Batch results where possible. Use asynchronous communication patterns. Measure whether multi-agent coordination actually saves time versus a single agent with a longer context.
169
+
170
+ **Failure: Divergence**
171
+ Agents pursuing different goals without central coordination drift from intended objectives.
172
+
173
+ Mitigate by defining clear objective boundaries for each agent. Implement convergence checks that verify progress toward shared goals. Set time-to-live limits on agent execution to prevent unbounded exploration.
174
+
175
+ **Failure: Error Propagation**
176
+ Errors in one agent's output propagate to downstream agents that consume that output, compounding into increasingly wrong results.
177
+
178
+ Mitigate by validating agent outputs before passing to consumers. Implement retry logic with circuit breakers. Use idempotent operations where possible. Consider adding a verification agent that cross-checks critical outputs before they enter the pipeline.
179
+
180
+ ## Examples
181
+
182
+ **Example 1: Research Team Architecture**
183
+ ```text
184
+ Supervisor
185
+ ├── Researcher (web search, document retrieval)
186
+ ├── Analyzer (data analysis, statistics)
187
+ ├── Fact-checker (verification, validation)
188
+ └── Writer (report generation, formatting)
189
+ ```
190
+
191
+ **Example 2: Handoff Protocol**
192
+ ```python
193
+ def handle_customer_request(request):
194
+ if request.type == "billing":
195
+ return transfer_to(billing_agent)
196
+ elif request.type == "technical":
197
+ return transfer_to(technical_agent)
198
+ elif request.type == "sales":
199
+ return transfer_to(sales_agent)
200
+ else:
201
+ return handle_general(request)
202
+ ```
203
+
204
+ ## Guidelines
205
+
206
+ 1. Design for context isolation as the primary benefit of multi-agent systems
207
+ 2. Choose architecture pattern based on coordination needs, not organizational metaphor
208
+ 3. Implement explicit handoff protocols with state passing
209
+ 4. Use weighted voting or debate protocols for consensus
210
+ 5. Monitor for supervisor bottlenecks and implement checkpointing
211
+ 6. Validate outputs before passing between agents
212
+ 7. Set time-to-live limits to prevent infinite loops
213
+ 8. Test failure scenarios explicitly
214
+
215
+ ## Gotchas
216
+
217
+ 1. **Supervisor bottleneck scaling** — Supervisor context pressure grows non-linearly with worker count. At 5+ workers, the supervisor spends more tokens processing summaries than workers spend on actual tasks. Set a hard cap on workers per supervisor (3-5) and add a second supervisor tier rather than overloading one.
218
+ 2. **Token cost underestimation** — Multi-agent runs cost approximately 15x baseline. Teams consistently underbudget because they estimate per-agent costs without accounting for coordination overhead, retries, and consensus rounds. Budget for 15x and treat anything less as a bonus.
219
+ 3. **Sycophantic consensus** — Agents in debate patterns tend to converge on agreeable answers, not correct ones. LLMs have an inherent bias toward agreement. Counter this by assigning explicit adversarial roles and requiring agents to state disagreements before convergence is allowed.
220
+ 4. **Agent sprawl** — Adding more agents past 3-5 shows diminishing returns and increases coordination overhead. Each additional agent adds communication channels quadratically. Start with the minimum viable number of agents and add only when a clear context isolation benefit exists.
221
+ 5. **Telephone game in message-passing** — Information degrades through repeated summarization as it passes between agents. Each agent paraphrases and loses nuance. Use filesystem coordination instead of message-passing for state that multiple agents need to access faithfully.
222
+ 6. **Error propagation cascades** — One agent's hallucination becomes another agent's "fact." Downstream agents have no way to distinguish upstream hallucinations from genuine information. Add validation checkpoints between agents and never trust upstream output without verification.
223
+ 7. **Over-decomposition** — Splitting tasks too finely creates more coordination overhead than the task itself. A 10-step pipeline with 10 agents spends more tokens on handoffs than on actual work. Decompose only when subtasks genuinely benefit from separate contexts.
224
+ 8. **Missing shared state** — Agents operating without a shared filesystem or state store duplicate work, produce inconsistent outputs, and lose track of what has already been accomplished. Establish shared persistent storage before building multi-agent workflows.
225
+
226
+ ## Integration
227
+
228
+ This skill builds on context-fundamentals and context-degradation. It connects to:
229
+
230
+ - memory-systems - Shared state management across agents
231
+ - tool-design - Tool specialization per agent
232
+ - context-optimization - Context partitioning strategies
233
+
234
+ ## References
235
+
236
+ Internal reference:
237
+ - [Frameworks Reference](./references/frameworks.md) - Read when: implementing a specific multi-agent pattern in LangGraph, AutoGen, or CrewAI and needing framework-specific code examples
238
+
239
+ Related skills in this collection:
240
+ - context-fundamentals - Read when: needing to understand context window mechanics before designing agent partitioning
241
+ - memory-systems - Read when: agents need to share state across context boundaries or persist information between runs
242
+ - context-optimization - Read when: individual agent contexts are too large and need partitioning or compression strategies
243
+
244
+ External resources:
245
+ - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/) - Read when: building graph-based multi-agent workflows with explicit state machines
246
+ - [AutoGen Framework](https://microsoft.github.io/autogen/) - Read when: implementing conversational GroupChat patterns or event-driven agent coordination
247
+ - [CrewAI Documentation](https://docs.crewai.com/) - Read when: designing role-based hierarchical agent processes
248
+ - [Research on Multi-Agent Coordination](https://arxiv.org/abs/2308.00352) - Read when: needing academic grounding on multi-agent system theory and evaluation
249
+
250
+ ---
251
+
252
+ ## Skill Metadata
253
+
254
+ **Created**: 2025-12-20
255
+ **Last Updated**: 2026-03-17
256
+ **Author**: Agent Skills for Context Engineering Contributors
257
+ **Version**: 2.0.0
@@ -0,0 +1,291 @@
1
+ ---
2
+ name: project-development
3
+ description: This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches.
4
+ ---
5
+
6
+ # Project Development Methodology
7
+
8
+ This skill covers the principles for identifying tasks suited to LLM processing, designing effective project architectures, and iterating rapidly using agent-assisted development. The methodology applies whether building a batch processing pipeline, a multi-agent research system, or an interactive agent application.
9
+
10
+ ## When to Activate
11
+
12
+ Activate this skill when:
13
+ - Starting a new project that might benefit from LLM processing
14
+ - Evaluating whether a task is well-suited for agents versus traditional code
15
+ - Designing the architecture for an LLM-powered application
16
+ - Planning a batch processing pipeline with structured outputs
17
+ - Choosing between single-agent and multi-agent approaches
18
+ - Estimating costs and timelines for LLM-heavy projects
19
+
20
+ ## Core Concepts
21
+
22
+ ### Task-Model Fit Recognition
23
+
24
+ Evaluate task-model fit before writing any code, because building automation on a fundamentally mismatched task wastes days of effort. Run every proposed task through these two tables to decide proceed-or-stop.
25
+
26
+ **Proceed when the task has these characteristics:**
27
+
28
+ | Characteristic | Rationale |
29
+ |----------------|-----------|
30
+ | Synthesis across sources | LLMs combine information from multiple inputs better than rule-based alternatives |
31
+ | Subjective judgment with rubrics | Grading, evaluation, and classification with criteria map naturally to language reasoning |
32
+ | Natural language output | When the goal is human-readable text, LLMs deliver it natively |
33
+ | Error tolerance | Individual failures do not break the overall system, so LLM non-determinism is acceptable |
34
+ | Batch processing | No conversational state required between items, which keeps context clean |
35
+ | Domain knowledge in training | The model already has relevant context, reducing prompt engineering overhead |
36
+
37
+ **Stop when the task has these characteristics:**
38
+
39
+ | Characteristic | Rationale |
40
+ |----------------|-----------|
41
+ | Precise computation | Math, counting, and exact algorithms are unreliable in language models |
42
+ | Real-time requirements | LLM latency is too high for sub-second responses |
43
+ | Perfect accuracy requirements | Hallucination risk makes 100% accuracy impossible |
44
+ | Proprietary data dependence | The model lacks necessary context and cannot acquire it from prompts alone |
45
+ | Sequential dependencies | Each step depends heavily on the previous result, compounding errors |
46
+ | Deterministic output requirements | Same input must produce identical output, which LLMs cannot guarantee |
47
+
48
+ ### The Manual Prototype Step
49
+
50
+ Always validate task-model fit with a manual test before investing in automation. Copy one representative input into the model interface, evaluate the output quality, and use the result to answer these questions:
51
+
52
+ - Does the model have the knowledge required for this task?
53
+ - Can the model produce output in the format needed?
54
+ - What level of quality should be expected at scale?
55
+ - Are there obvious failure modes to address?
56
+
57
+ Do this because a failed manual prototype predicts a failed automated system, while a successful one provides both a quality baseline and a prompt-design template. The test takes minutes and prevents hours of wasted development.
58
+
59
+ ### Pipeline Architecture
60
+
61
+ Structure LLM projects as staged pipelines because separation of deterministic and non-deterministic stages enables fast iteration and cost control. Design each stage to be:
62
+
63
+ - **Discrete**: Clear boundaries between stages so each can be debugged independently
64
+ - **Idempotent**: Re-running produces the same result, preventing duplicate work
65
+ - **Cacheable**: Intermediate results persist to disk, avoiding expensive re-computation
66
+ - **Independent**: Each stage can run separately, enabling selective re-execution
67
+
68
+ **Use this canonical pipeline structure:**
69
+
70
+ ```
71
+ acquire -> prepare -> process -> parse -> render
72
+ ```
73
+
74
+ 1. **Acquire**: Fetch raw data from sources (APIs, files, databases)
75
+ 2. **Prepare**: Transform data into prompt format
76
+ 3. **Process**: Execute LLM calls (the expensive, non-deterministic step)
77
+ 4. **Parse**: Extract structured data from LLM outputs
78
+ 5. **Render**: Generate final outputs (reports, files, visualizations)
79
+
80
+ Stages 1, 2, 4, and 5 are deterministic. Stage 3 is non-deterministic and expensive. Maintain this separation because it allows re-running the expensive LLM stage only when necessary, while iterating quickly on parsing and rendering.
81
+
82
+ ### File System as State Machine
83
+
84
+ Use the file system to track pipeline state rather than databases or in-memory structures, because file existence provides natural idempotency and human-readable debugging.
85
+
86
+ ```
87
+ data/{id}/
88
+ raw.json # acquire stage complete
89
+ prompt.md # prepare stage complete
90
+ response.md # process stage complete
91
+ parsed.json # parse stage complete
92
+ ```
93
+
94
+ Check if an item needs processing by checking whether the output file exists. Re-run a stage by deleting its output file and downstream files. Debug by reading the intermediate files directly. This pattern works because each directory is independent, enabling simple parallelization and trivial caching.
95
+
96
+ ### Structured Output Design
97
+
98
+ Design prompts for structured, parseable outputs because prompt design directly determines parsing reliability. Include these elements in every structured prompt:
99
+
100
+ 1. **Section markers**: Explicit headers or prefixes that parsers can match on
101
+ 2. **Format examples**: Show exactly what output should look like
102
+ 3. **Rationale disclosure**: State "I will be parsing this programmatically" so the model prioritizes format compliance
103
+ 4. **Constrained values**: Enumerated options, score ranges, and fixed formats
104
+
105
+ Build parsers that handle LLM output variations gracefully, because LLMs do not follow instructions perfectly. Use regex patterns flexible enough for minor formatting variations, provide sensible defaults when sections are missing, and log parsing failures for review rather than crashing.
106
+
107
+ ### Agent-Assisted Development
108
+
109
+ Use agent-capable models to accelerate development through rapid iteration: describe the project goal and constraints, let the agent generate initial implementation, test and iterate on specific failures, then refine prompts and architecture based on results.
110
+
111
+ Adopt these practices because they keep agent output focused and high-quality:
112
+ - Provide clear, specific requirements upfront to reduce revision cycles
113
+ - Break large projects into discrete components so each can be validated independently
114
+ - Test each component before moving to the next to catch failures early
115
+ - Keep the agent focused on one task at a time to prevent context degradation
116
+
117
+ ### Cost and Scale Estimation
118
+
119
+ Estimate LLM processing costs before starting, because token costs compound quickly at scale and late discovery of budget overruns forces costly rework. Use this formula:
120
+
121
+ ```
122
+ Total cost = (items x tokens_per_item x price_per_token) + API overhead
123
+ ```
124
+
125
+ For batch processing, estimate input tokens per item (prompt + context), estimate output tokens per item (typical response length), multiply by item count, and add 20-30% buffer for retries and failures.
126
+
127
+ Track actual costs during development. If costs exceed estimates significantly, reduce context length through truncation, use smaller models for simpler items, cache and reuse partial results, or add parallel processing to reduce wall-clock time.
128
+
129
+ ## Detailed Topics
130
+
131
+ ### Choosing Single vs Multi-Agent Architecture
132
+
133
+ Default to single-agent pipelines for batch processing with independent items, because they are simpler to manage, cheaper to run, and easier to debug. Escalate to multi-agent architectures only when one of these conditions holds:
134
+
135
+ - Parallel exploration of different aspects is required
136
+ - The task exceeds single context window capacity
137
+ - Specialized sub-agents demonstrably improve quality on benchmarks
138
+
139
+ Choose multi-agent for context isolation, not role anthropomorphization. Sub-agents get fresh context windows for focused subtasks, which prevents context degradation on long-running tasks.
140
+
141
+ See `multi-agent-patterns` skill for detailed architecture guidance.
142
+
143
+ ### Architectural Reduction
144
+
145
+ Start with minimal architecture and add complexity only when production evidence proves it necessary, because over-engineered scaffolding often constrains rather than enables model performance.
146
+
147
+ Vercel's d0 agent achieved 100% success rate (up from 80%) by reducing from 17 specialized tools to 2 primitives: bash command execution and SQL. The file system agent pattern uses standard Unix utilities (grep, cat, find, ls) instead of custom exploration tools.
148
+
149
+ **Reduce when:**
150
+ - The data layer is well-documented and consistently structured
151
+ - The model has sufficient reasoning capability
152
+ - Specialized tools are constraining rather than enabling
153
+ - More time is spent maintaining scaffolding than improving outcomes
154
+
155
+ **Add complexity when:**
156
+ - The underlying data is messy, inconsistent, or poorly documented
157
+ - The domain requires specialized knowledge the model lacks
158
+ - Safety constraints require limiting agent capabilities
159
+ - Operations are truly complex and benefit from structured workflows
160
+
161
+ See `tool-design` skill for detailed tool architecture guidance.
162
+
163
+ ### Iteration and Refactoring
164
+
165
+ Plan for multiple architectural iterations from the start, because production agent systems at scale always require refactoring. Manus refactored their agent framework five times since launch. The Bitter Lesson suggests that structures added for current model limitations become constraints as models improve.
166
+
167
+ Build for change by following these practices:
168
+ - Keep architecture simple and unopinionated so refactoring is cheap
169
+ - Test across model generations to verify the harness is not limiting performance
170
+ - Design systems that benefit from model improvements rather than locking in limitations
171
+
172
+ ## Practical Guidance
173
+
174
+ ### Project Planning Template
175
+
176
+ Follow this template in order, because each step validates assumptions before the next step invests effort.
177
+
178
+ 1. **Task Analysis**
179
+ - Define the input and desired output explicitly
180
+ - Classify: synthesis, generation, classification, or analysis
181
+ - Set an acceptable error rate based on business impact
182
+ - Estimate the value per successful completion to justify costs
183
+
184
+ 2. **Manual Validation**
185
+ - Test one representative example with the target model
186
+ - Evaluate output quality and format against requirements
187
+ - Identify failure modes that need parser hardening or prompt revision
188
+ - Estimate tokens per item for cost projection
189
+
190
+ 3. **Architecture Selection**
191
+ - Choose single pipeline vs multi-agent based on the criteria above
192
+ - Identify required tools and data sources
193
+ - Design storage and caching strategy using file-system state
194
+ - Plan parallelization approach for the process stage
195
+
196
+ 4. **Cost Estimation**
197
+ - Calculate items x tokens x price with a 20-30% buffer
198
+ - Estimate development time for each pipeline stage
199
+ - Identify infrastructure requirements (API keys, storage, compute)
200
+ - Project ongoing operational costs for production runs
201
+
202
+ 5. **Development Plan**
203
+ - Implement stage-by-stage, testing each before proceeding
204
+ - Define a testing strategy per stage with expected outputs
205
+ - Set iteration milestones tied to quality metrics
206
+ - Plan deployment approach with rollback capability
207
+
208
+ ## Examples
209
+
210
+ **Example 1: Batch Analysis Pipeline (Karpathy's HN Time Capsule)**
211
+
212
+ Task: Analyze 930 HN discussions from 10 years ago with hindsight grading.
213
+
214
+ Architecture:
215
+ - 5-stage pipeline: fetch -> prompt -> analyze -> parse -> render
216
+ - File system state: data/{date}/{item_id}/ with stage output files
217
+ - Structured output: 6 sections with explicit format requirements
218
+ - Parallel execution: 15 workers for LLM calls
219
+
220
+ Results: $58 total cost, ~1 hour execution, static HTML output.
221
+
222
+ **Example 2: Architectural Reduction (Vercel d0)**
223
+
224
+ Task: Text-to-SQL agent for internal analytics.
225
+
226
+ Before: 17 specialized tools, 80% success rate, 274s average execution.
227
+
228
+ After: 2 tools (bash + SQL), 100% success rate, 77s average execution.
229
+
230
+ Key insight: The semantic layer was already good documentation. Claude just needed access to read files directly.
231
+
232
+ See [Case Studies](./references/case-studies.md) for detailed analysis.
233
+
234
+ ## Guidelines
235
+
236
+ 1. Validate task-model fit with manual prototyping before building automation
237
+ 2. Structure pipelines as discrete, idempotent, cacheable stages
238
+ 3. Use the file system for state management and debugging
239
+ 4. Design prompts for structured, parseable outputs with explicit format examples
240
+ 5. Start with minimal architecture; add complexity only when proven necessary
241
+ 6. Estimate costs early and track throughout development
242
+ 7. Build robust parsers that handle LLM output variations
243
+ 8. Expect and plan for multiple architectural iterations
244
+ 9. Test whether scaffolding helps or constrains model performance
245
+ 10. Use agent-assisted development for rapid iteration on implementation
246
+
247
+ ## Gotchas
248
+
249
+ 1. **Skipping manual validation**: Building automation before verifying the model can do the task wastes significant time when the approach is fundamentally flawed. Always run one representative example through the model interface first.
250
+ 2. **Monolithic pipelines**: Combining all stages into one script makes debugging and iteration difficult. Separate stages with persistent intermediate outputs so each can be re-run independently.
251
+ 3. **Over-constraining the model**: Adding guardrails, pre-filtering, and validation logic that the model could handle on its own reduces performance. Test whether scaffolding helps or hurts before keeping it.
252
+ 4. **Ignoring costs until production**: Token costs compound quickly at scale. Estimate and track from the beginning to avoid budget surprises that force architectural rework.
253
+ 5. **Perfect parsing requirements**: Expecting LLMs to follow format instructions perfectly leads to brittle systems. Build robust parsers that handle variations and log failures for review.
254
+ 6. **Premature optimization**: Adding caching, parallelization, and optimization before the basic pipeline works correctly wastes effort on code that may be discarded during iteration.
255
+ 7. **Model version lock-in**: Building pipelines that only work with one specific model version creates fragile systems. Test across model generations and abstract the LLM call layer so models can be swapped without rewriting pipeline logic.
256
+ 8. **Evaluation-less deployment**: Shipping agent pipelines without measuring output quality means regressions go undetected. Define quality metrics during development and run evaluation checks before and after every model or prompt change.
257
+
258
+ ## Integration
259
+
260
+ This skill connects to:
261
+ - context-fundamentals - Understanding context constraints for prompt design
262
+ - tool-design - Designing tools for agent systems within pipelines
263
+ - multi-agent-patterns - When to use multi-agent versus single pipelines
264
+ - evaluation - Evaluating pipeline outputs and agent performance
265
+ - context-compression - Managing context when pipelines exceed limits
266
+
267
+ ## References
268
+
269
+ Internal references:
270
+ - [Case Studies](./references/case-studies.md) - Read when: evaluating architecture tradeoffs or reviewing real-world pipeline implementations (Karpathy HN Capsule, Vercel d0, Manus patterns)
271
+ - [Pipeline Patterns](./references/pipeline-patterns.md) - Read when: designing a new pipeline stage layout, choosing caching strategies, or debugging stage boundaries
272
+
273
+ Related skills in this collection:
274
+ - tool-design - Tool architecture and reduction patterns
275
+ - multi-agent-patterns - When to use multi-agent architectures
276
+ - evaluation - Output evaluation frameworks
277
+
278
+ External resources:
279
+ - Karpathy's HN Time Capsule project: https://github.com/karpathy/hn-time-capsule
280
+ - Vercel d0 architectural reduction: https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
281
+ - Manus context engineering: Peak Ji's blog on context engineering lessons
282
+ - Anthropic multi-agent research: How we built our multi-agent research system
283
+
284
+ ---
285
+
286
+ ## Skill Metadata
287
+
288
+ **Created**: 2025-12-25
289
+ **Last Updated**: 2026-03-17
290
+ **Author**: Agent Skills for Context Engineering Contributors
291
+ **Version**: 1.1.0