aiag-cli 2.2.2 → 2.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +72 -37
- package/dist/cli.js +30 -2
- package/dist/cli.js.map +1 -1
- package/dist/commands/auto.js +45 -41
- package/dist/commands/auto.js.map +1 -1
- package/dist/commands/feature.d.ts +11 -0
- package/dist/commands/feature.d.ts.map +1 -0
- package/dist/commands/feature.js +153 -0
- package/dist/commands/feature.js.map +1 -0
- package/dist/commands/init.d.ts +1 -1
- package/dist/commands/init.d.ts.map +1 -1
- package/dist/commands/init.js +29 -78
- package/dist/commands/init.js.map +1 -1
- package/dist/commands/prd.d.ts +12 -0
- package/dist/commands/prd.d.ts.map +1 -0
- package/dist/commands/prd.js +179 -0
- package/dist/commands/prd.js.map +1 -0
- package/dist/prompts/coding.d.ts.map +1 -1
- package/dist/prompts/coding.js +12 -0
- package/dist/prompts/coding.js.map +1 -1
- package/dist/prompts/index.d.ts +2 -0
- package/dist/prompts/index.d.ts.map +1 -1
- package/dist/prompts/index.js +2 -0
- package/dist/prompts/index.js.map +1 -1
- package/dist/prompts/initializer.d.ts.map +1 -1
- package/dist/prompts/initializer.js +6 -0
- package/dist/prompts/initializer.js.map +1 -1
- package/dist/prompts/prd.d.ts +28 -0
- package/dist/prompts/prd.d.ts.map +1 -0
- package/dist/prompts/prd.js +105 -0
- package/dist/prompts/prd.js.map +1 -0
- package/dist/skills/index.d.ts +12 -0
- package/dist/skills/index.d.ts.map +1 -0
- package/dist/skills/index.js +12 -0
- package/dist/skills/index.js.map +1 -0
- package/dist/skills/installer.d.ts +38 -0
- package/dist/skills/installer.d.ts.map +1 -0
- package/dist/skills/installer.js +153 -0
- package/dist/skills/installer.js.map +1 -0
- package/dist/skills/loader.d.ts +34 -0
- package/dist/skills/loader.d.ts.map +1 -0
- package/dist/skills/loader.js +134 -0
- package/dist/skills/loader.js.map +1 -0
- package/dist/skills/runner.d.ts +14 -0
- package/dist/skills/runner.d.ts.map +1 -0
- package/dist/skills/runner.js +238 -0
- package/dist/skills/runner.js.map +1 -0
- package/dist/types.d.ts +127 -0
- package/dist/types.d.ts.map +1 -1
- package/dist/utils/prd.d.ts +21 -0
- package/dist/utils/prd.d.ts.map +1 -1
- package/dist/utils/prd.js +69 -0
- package/dist/utils/prd.js.map +1 -1
- package/dist/utils/taskmasterConverter.d.ts +72 -0
- package/dist/utils/taskmasterConverter.d.ts.map +1 -0
- package/dist/utils/taskmasterConverter.js +401 -0
- package/dist/utils/taskmasterConverter.js.map +1 -0
- package/dist/utils/taskmasterParser.d.ts +35 -0
- package/dist/utils/taskmasterParser.d.ts.map +1 -0
- package/dist/utils/taskmasterParser.js +259 -0
- package/dist/utils/taskmasterParser.js.map +1 -0
- package/package.json +1 -1
- package/templates/skills/prd-taskmaster/.taskmaster/docs/prd.md +2571 -0
- package/templates/skills/prd-taskmaster/.taskmaster/scripts/execution-state.py +87 -0
- package/templates/skills/prd-taskmaster/.taskmaster/scripts/learn-accuracy.py +113 -0
- package/templates/skills/prd-taskmaster/.taskmaster/scripts/rollback.sh +71 -0
- package/templates/skills/prd-taskmaster/.taskmaster/scripts/security-audit.py +130 -0
- package/templates/skills/prd-taskmaster/.taskmaster/scripts/track-time.py +133 -0
- package/templates/skills/prd-taskmaster/LICENSE +21 -0
- package/templates/skills/prd-taskmaster/README.md +608 -0
- package/templates/skills/prd-taskmaster/SKILL.md +1258 -0
- package/templates/skills/prd-taskmaster/reference/taskmaster-integration-guide.md +645 -0
- package/templates/skills/prd-taskmaster/reference/validation-checklist.md +394 -0
- package/templates/skills/prd-taskmaster/scripts/setup-taskmaster.sh +112 -0
- package/templates/skills/prd-taskmaster/templates/CLAUDE.md.template +635 -0
- package/templates/skills/prd-taskmaster/templates/taskmaster-prd-comprehensive.md +983 -0
- package/templates/skills/prd-taskmaster/templates/taskmaster-prd-minimal.md +103 -0
|
@@ -0,0 +1,2571 @@
|
|
|
1
|
+
# PRD: Agentic Arena-Based Skill Creation & Optimization System
|
|
2
|
+
|
|
3
|
+
**Author:** anombyte
|
|
4
|
+
**Date:** 2025-01-22
|
|
5
|
+
**Status:** Ready for Implementation
|
|
6
|
+
**Version:** 4.0 (Taskmaster Optimized)
|
|
7
|
+
**Taskmaster Optimized:** Yes
|
|
8
|
+
**Original Version:** v3.0 from skill_creating/planning/PRD.md
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Table of Contents
|
|
13
|
+
|
|
14
|
+
1. [Executive Summary](#executive-summary)
|
|
15
|
+
2. [Problem Statement](#problem-statement)
|
|
16
|
+
3. [Goals & Success Metrics](#goals--success-metrics)
|
|
17
|
+
4. [User Stories](#user-stories)
|
|
18
|
+
5. [Functional Requirements](#functional-requirements)
|
|
19
|
+
6. [Non-Functional Requirements](#non-functional-requirements)
|
|
20
|
+
7. [Technical Considerations](#technical-considerations)
|
|
21
|
+
8. [Implementation Roadmap](#implementation-roadmap)
|
|
22
|
+
9. [Out of Scope](#out-of-scope)
|
|
23
|
+
10. [Open Questions & Risks](#open-questions--risks)
|
|
24
|
+
11. [Validation Checkpoints](#validation-checkpoints)
|
|
25
|
+
12. [Appendix: Task Breakdown Hints](#appendix-task-breakdown-hints)
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Executive Summary
|
|
30
|
+
|
|
31
|
+
Current Claude Code skill creation is "write once" without optimization or validation. Users need an intelligent system that evolves skills through tournament-style arena battles with empirical testing—comparing real outputs, not theoretical code quality. This system provides database-first collective knowledge (show existing skills before creating), delivers quick base skills in 30 seconds for immediate use, then runs background optimization (25-45 min) using agentic orchestration and LLM-as-judge evaluation. Expected impact: average skill scores improve from 78/100 (base) to 93/100 (optimized), with 50% of requests served from collective database within 3 months.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Problem Statement
|
|
36
|
+
|
|
37
|
+
### Current Situation
|
|
38
|
+
|
|
39
|
+
Users create Claude Code skills manually by writing SKILL.md files without:
|
|
40
|
+
- **Quality validation**: No way to know if skill will work well before deploying
|
|
41
|
+
- **Optimization**: Skills are written once and never improved
|
|
42
|
+
- **Collective knowledge**: Each user reinvents solutions others have already created
|
|
43
|
+
- **Empirical testing**: Skills are evaluated by reading code, not running them
|
|
44
|
+
- **Evolution mechanism**: No systematic way to iterate and improve skills
|
|
45
|
+
|
|
46
|
+
**Evidence:**
|
|
47
|
+
- From user requirement: "I can't create PRDs myself. I want the best possible PRD for optimal outcomes."
|
|
48
|
+
- User explicitly stated: "Planning is 95% of the work with vibe coding"
|
|
49
|
+
- Current skill_creating skill guides creation but doesn't optimize or validate
|
|
50
|
+
|
|
51
|
+
### User Impact
|
|
52
|
+
|
|
53
|
+
- **Who is affected:** Claude Code users creating custom skills (engineers, technical users)
|
|
54
|
+
- **How they're affected:**
|
|
55
|
+
- Spend hours writing skills that may not work well
|
|
56
|
+
- No feedback on skill quality until they use it in production
|
|
57
|
+
- Reinvent solutions others have already created
|
|
58
|
+
- No systematic improvement process
|
|
59
|
+
- Miss opportunities to leverage collective expertise
|
|
60
|
+
- **Severity:** High - Directly impacts development velocity and skill effectiveness
|
|
61
|
+
|
|
62
|
+
### Business Impact
|
|
63
|
+
|
|
64
|
+
- **Cost of problem:**
|
|
65
|
+
- Wasted time creating suboptimal skills (estimated 2-4 hours per skill)
|
|
66
|
+
- Poor skill quality reduces Claude Code effectiveness
|
|
67
|
+
- User frustration from trial-and-error skill development
|
|
68
|
+
- **Opportunity cost:**
|
|
69
|
+
- Missing collective intelligence benefits (GitHub Copilot-like network effects)
|
|
70
|
+
- Not capitalizing on community improvements
|
|
71
|
+
- Slower Claude Code adoption due to skill creation friction
|
|
72
|
+
- **Strategic importance:**
|
|
73
|
+
- Skills are core differentiator for Claude Code vs competitors
|
|
74
|
+
- Quality skill ecosystem drives user retention and engagement
|
|
75
|
+
- Collective evolution enables exponential improvement vs linear
|
|
76
|
+
|
|
77
|
+
### Why Solve This Now?
|
|
78
|
+
|
|
79
|
+
1. **2025 LLM evaluation best practices available**: Arena-Lite architecture, LLM-as-judge patterns, realistic test generation
|
|
80
|
+
2. **Technical capability ready**: Claude Code Task tool enables background agentic orchestration
|
|
81
|
+
3. **User demand clear**: Explicit request for "best possible" skills with automated optimization
|
|
82
|
+
4. **Competitive timing**: First to market with collective skill evolution for AI coding tools
|
|
83
|
+
5. **Foundation for future**: This enables advanced features (server farms, continuous evolution)
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## Goals & Success Metrics
|
|
88
|
+
|
|
89
|
+
### Goal 1: Improve Skill Quality Through Arena Optimization
|
|
90
|
+
|
|
91
|
+
**Description:** Skills optimized through arena battles score significantly higher than base versions
|
|
92
|
+
|
|
93
|
+
**Metric:** Average score improvement (final vs base skill)
|
|
94
|
+
|
|
95
|
+
**Baseline:** 0 points (no optimization exists today)
|
|
96
|
+
|
|
97
|
+
**Target:** +15 points average (e.g., 78/100 base → 93/100 optimized)
|
|
98
|
+
|
|
99
|
+
**Timeframe:** Measured per skill, target achieved for 80% of skills within 30 min arena completion
|
|
100
|
+
|
|
101
|
+
**Measurement Method:** Automated scoring via LLM-as-judge comparing base (v0.1) vs optimized (v1.0) skill outputs
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
### Goal 2: Enable Collective Knowledge Reuse
|
|
106
|
+
|
|
107
|
+
**Description:** Users find and reuse existing high-quality skills instead of recreating
|
|
108
|
+
|
|
109
|
+
**Metric:** Reuse rate (% of skill requests served from collective database)
|
|
110
|
+
|
|
111
|
+
**Baseline:** 0% (no collective database exists)
|
|
112
|
+
|
|
113
|
+
**Target:** 50% of requests match existing skills within 3 months
|
|
114
|
+
|
|
115
|
+
**Timeframe:** 3 months post-launch
|
|
116
|
+
|
|
117
|
+
**Measurement Method:** Track database queries with confidence > 0.8, user selection of existing vs "build custom"
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
### Goal 3: Fast Time-to-First-Value
|
|
122
|
+
|
|
123
|
+
**Description:** Users get working skill immediately while optimization runs in background
|
|
124
|
+
|
|
125
|
+
**Metric:** Time to base skill (v0.1) delivery
|
|
126
|
+
|
|
127
|
+
**Baseline:** N/A (current: manual creation 30-120 min)
|
|
128
|
+
|
|
129
|
+
**Target:** < 30 seconds for base skill generation
|
|
130
|
+
|
|
131
|
+
**Timeframe:** Every skill creation
|
|
132
|
+
|
|
133
|
+
**Measurement Method:** Track timestamp from user request to v0.1 skill deployed and usable
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
### Goal 4: Reliable Arena Completion Time
|
|
138
|
+
|
|
139
|
+
**Description:** Background optimization completes within predictable time windows
|
|
140
|
+
|
|
141
|
+
**Metric:** Arena completion time (p95)
|
|
142
|
+
|
|
143
|
+
**Baseline:** N/A
|
|
144
|
+
|
|
145
|
+
**Target:** < 30 minutes for moderate-complexity skills (p95)
|
|
146
|
+
|
|
147
|
+
**Timeframe:** Every arena execution
|
|
148
|
+
|
|
149
|
+
**Measurement Method:** Track arena start → convergence timestamps, categorize by skill complexity
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
### Goal 5: High User Satisfaction
|
|
154
|
+
|
|
155
|
+
**Description:** Users rate optimized skills highly and adopt the system
|
|
156
|
+
|
|
157
|
+
**Metric:** Average user rating of optimized skills
|
|
158
|
+
|
|
159
|
+
**Baseline:** N/A
|
|
160
|
+
|
|
161
|
+
**Target:** ≥ 4.5/5 stars average
|
|
162
|
+
|
|
163
|
+
**Timeframe:** Ongoing (minimum 50 ratings for statistical validity)
|
|
164
|
+
|
|
165
|
+
**Measurement Method:** Post-execution optional rating prompt (1-5 stars), aggregate in database
|
|
166
|
+
|
|
167
|
+
---
|
|
168
|
+
|
|
169
|
+
### Goal 6: Build Thriving Collective Database
|
|
170
|
+
|
|
171
|
+
**Description:** Grow database of high-quality community-contributed skills
|
|
172
|
+
|
|
173
|
+
**Metric:** Total unique skills in collective database
|
|
174
|
+
|
|
175
|
+
**Baseline:** 0 skills
|
|
176
|
+
|
|
177
|
+
**Target:** 100+ skills across diverse domains in 3 months
|
|
178
|
+
|
|
179
|
+
**Timeframe:** 3 months post-launch
|
|
180
|
+
|
|
181
|
+
**Measurement Method:** Count unique skill_id entries in Pinecone database
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## User Stories
|
|
186
|
+
|
|
187
|
+
### Story 1: Database-First Skill Discovery
|
|
188
|
+
|
|
189
|
+
**As a** Claude Code user,
|
|
190
|
+
**I want to** see existing high-quality skills before creating a new one,
|
|
191
|
+
**So that I can** reuse proven solutions instead of reinventing.
|
|
192
|
+
|
|
193
|
+
**Acceptance Criteria:**
|
|
194
|
+
- [ ] System queries Pinecone database before generating new skill
|
|
195
|
+
- [ ] Shows matching skills with arena scores (e.g., 91.5/100)
|
|
196
|
+
- [ ] Shows user ratings (e.g., ⭐4.7/5 from 342 users)
|
|
197
|
+
- [ ] Shows last updated timestamp and key features
|
|
198
|
+
- [ ] User can select existing skill or choose "Build custom"
|
|
199
|
+
- [ ] Confidence scoring for matches (>0.8 shown, <0.8 clarifies)
|
|
200
|
+
- [ ] Results appear within 2 seconds of query
|
|
201
|
+
|
|
202
|
+
**Task Breakdown Hint:**
|
|
203
|
+
- Task 1.1: Implement Pinecone vector search integration (6h)
|
|
204
|
+
- Task 1.2: Build requirement fingerprint generation (4h)
|
|
205
|
+
- Task 1.3: Create search results display UI (5h)
|
|
206
|
+
- Task 1.4: Add confidence scoring and ranking logic (3h)
|
|
207
|
+
- Task 1.5: Implement user selection workflow (3h)
|
|
208
|
+
- Task 1.6: Write tests for search accuracy (4h)
|
|
209
|
+
|
|
210
|
+
**Dependencies:** R11 (Collective API & Database), R4 (Requirement Extraction)
|
|
211
|
+
|
|
212
|
+
---
|
|
213
|
+
|
|
214
|
+
### Story 2: Progressive Skill Delivery (Quick Base → Optimized)
|
|
215
|
+
|
|
216
|
+
**As a** Claude Code user,
|
|
217
|
+
**I want to** get a working skill immediately while optimization runs in background,
|
|
218
|
+
**So that I can** start using it right away without waiting 30 minutes.
|
|
219
|
+
|
|
220
|
+
**Acceptance Criteria:**
|
|
221
|
+
- [ ] Quick initial research completes in <30 seconds
|
|
222
|
+
- [ ] Base skill (v0.1) generated and deployed immediately
|
|
223
|
+
- [ ] User can use v0.1 skill while arena runs in background
|
|
224
|
+
- [ ] Optional quick scoring of v0.1 (user can decline)
|
|
225
|
+
- [ ] Background arena starts automatically after v0.1 delivery
|
|
226
|
+
- [ ] User receives notification when optimized v1.0 ready
|
|
227
|
+
- [ ] Comparison shows improvement metrics (e.g., +15 points)
|
|
228
|
+
- [ ] User can review outputs and choose to deploy v1.0 or keep v0.1
|
|
229
|
+
|
|
230
|
+
**Task Breakdown Hint:**
|
|
231
|
+
- Task 2.1: Implement dual-track research system (quick + deep) (8h)
|
|
232
|
+
- Task 2.2: Build base skill generation from quick research (6h)
|
|
233
|
+
- Task 2.3: Create background job orchestration via Task tool (10h)
|
|
234
|
+
- Task 2.4: Implement optional quick scoring workflow (4h)
|
|
235
|
+
- Task 2.5: Build notification system for arena completion (3h)
|
|
236
|
+
- Task 2.6: Create comparison UI for v0.1 vs v1.0 (6h)
|
|
237
|
+
- Task 2.7: Write tests for progressive delivery flow (5h)
|
|
238
|
+
|
|
239
|
+
**Dependencies:** R2 (Dual-Track Research), R10 (Background Execution), R14 (Adaptive Complexity)
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
### Story 3: Agentic Question Generation
|
|
244
|
+
|
|
245
|
+
**As a** Claude Code user,
|
|
246
|
+
**I want** domain-specific questions about my requirements,
|
|
247
|
+
**So that** the system can optimize for what matters most to me.
|
|
248
|
+
|
|
249
|
+
**Acceptance Criteria:**
|
|
250
|
+
- [ ] System analyzes skill domain from user request
|
|
251
|
+
- [ ] Generates 3-7 questions specific to domain (e.g., PRD vs PDF extraction)
|
|
252
|
+
- [ ] Questions map to evaluation weight priorities
|
|
253
|
+
- [ ] Example question: "What's most important: completeness or speed?"
|
|
254
|
+
- [ ] User answers converted to weighted criteria (e.g., Quality: 64%, Speed: 5%)
|
|
255
|
+
- [ ] Default weights provided if user skips questions
|
|
256
|
+
- [ ] Question generation completes within quick research phase (<30s)
|
|
257
|
+
|
|
258
|
+
**Task Breakdown Hint:**
|
|
259
|
+
- Task 3.1: Build domain analysis agent (6h)
|
|
260
|
+
- Task 3.2: Create question generation templates by domain (8h)
|
|
261
|
+
- Task 3.3: Implement answer-to-weight conversion logic (5h)
|
|
262
|
+
- Task 3.4: Add default weight fallback (2h)
|
|
263
|
+
- Task 3.5: Write tests for question relevance (4h)
|
|
264
|
+
|
|
265
|
+
**Dependencies:** R3 (Agentic Question Generation), AGENTIC_WEIGHTING_SOLUTIONS.md integration
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
### Story 4: Tournament Arena with Empirical Testing
|
|
270
|
+
|
|
271
|
+
**As a** system administrator,
|
|
272
|
+
**I want** skills to compete in arena battles using real outputs,
|
|
273
|
+
**So that** winners are selected based on empirical quality, not theoretical code review.
|
|
274
|
+
|
|
275
|
+
**Acceptance Criteria:**
|
|
276
|
+
- [ ] Arena generates 3 skill variations (A, B, C) in Round 1
|
|
277
|
+
- [ ] All variations execute with identical realistic test input
|
|
278
|
+
- [ ] System captures complete real outputs (PRDs, code, data, etc.)
|
|
279
|
+
- [ ] LLM judge compares outputs directly (not code)
|
|
280
|
+
- [ ] Judge uses weighted criteria from user questions
|
|
281
|
+
- [ ] Pairwise comparisons with position bias mitigation (randomized order)
|
|
282
|
+
- [ ] Bradley-Terry ranking determines winner
|
|
283
|
+
- [ ] Winner advances to next round vs 2 new refined variations
|
|
284
|
+
- [ ] Arena stops when convergence detected (score plateau, time limit, or target achieved)
|
|
285
|
+
- [ ] Maximum 10 rounds or 30 minutes (whichever comes first)
|
|
286
|
+
|
|
287
|
+
**Task Breakdown Hint:**
|
|
288
|
+
- Task 4.1: Implement skill variation generator agent (10h)
|
|
289
|
+
- Task 4.2: Build skill execution isolation sandbox (8h)
|
|
290
|
+
- Task 4.3: Create output capture system (4h)
|
|
291
|
+
- Task 4.4: Implement LLM-as-judge with pairwise comparison (12h)
|
|
292
|
+
- Task 4.5: Add Bradley-Terry ranking algorithm (6h)
|
|
293
|
+
- Task 4.6: Build convergence detection logic (5h)
|
|
294
|
+
- Task 4.7: Create tournament orchestration loop (8h)
|
|
295
|
+
- Task 4.8: Write comprehensive arena tests (10h)
|
|
296
|
+
|
|
297
|
+
**Dependencies:** R5 (Realistic Test Data), R6 (Tournament Arena), R7 (Skill Execution), R8 (LLM-as-Judge), R9 (Convergence)
|
|
298
|
+
|
|
299
|
+
---
|
|
300
|
+
|
|
301
|
+
### Story 5: Realistic Test Data Generation
|
|
302
|
+
|
|
303
|
+
**As a** system administrator,
|
|
304
|
+
**I want** realistic test scenarios for skill evaluation,
|
|
305
|
+
**So that** arena battles reflect real-world usage, not toy examples.
|
|
306
|
+
|
|
307
|
+
**Acceptance Criteria:**
|
|
308
|
+
- [ ] Agent discovers realistic use cases via web search
|
|
309
|
+
- [ ] LLM takes persona appropriate to skill type (e.g., "Product Manager" for PRD skills)
|
|
310
|
+
- [ ] Generates realistic input data (not "Create a PRD" but "Add password reset to fintech SaaS app")
|
|
311
|
+
- [ ] Test scenarios evolve across rounds (simple → complex → edge case)
|
|
312
|
+
- [ ] Validates realism through domain pattern matching
|
|
313
|
+
- [ ] Caches validated scenarios in database for reuse
|
|
314
|
+
- [ ] Each skill tested with minimum 1 realistic scenario per round
|
|
315
|
+
|
|
316
|
+
**Task Breakdown Hint:**
|
|
317
|
+
- Task 5.1: Build test data generator agent with web search (8h)
|
|
318
|
+
- Task 5.2: Create persona-based scenario generation (6h)
|
|
319
|
+
- Task 5.3: Implement scenario evolution logic (5h)
|
|
320
|
+
- Task 5.4: Add realism validation (4h)
|
|
321
|
+
- Task 5.5: Build scenario caching in Pinecone (4h)
|
|
322
|
+
- Task 5.6: Write tests for scenario quality (4h)
|
|
323
|
+
|
|
324
|
+
**Dependencies:** R5 (Realistic Test Data Generation), Pinecone database
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
### Story 6: User-in-the-Loop Validation
|
|
329
|
+
|
|
330
|
+
**As a** Claude Code user,
|
|
331
|
+
**I want to** review arena outputs and score them myself,
|
|
332
|
+
**So that** I can validate automated judgments and provide feedback.
|
|
333
|
+
|
|
334
|
+
**Acceptance Criteria:**
|
|
335
|
+
- [ ] All arena results stored locally in `.claude/skills/[skill-name]/arena_results/`
|
|
336
|
+
- [ ] Results stored as JSON with inputs, outputs, scores, reasoning
|
|
337
|
+
- [ ] User can browse results after arena completion
|
|
338
|
+
- [ ] UI shows side-by-side comparison of outputs
|
|
339
|
+
- [ ] User can score each output (1-5 stars)
|
|
340
|
+
- [ ] User feedback submitted to database (opt-in)
|
|
341
|
+
- [ ] Feedback improves future arena weights and judgments
|
|
342
|
+
|
|
343
|
+
**Task Breakdown Hint:**
|
|
344
|
+
- Task 6.1: Create local arena results storage system (4h)
|
|
345
|
+
- Task 6.2: Build results browsing UI (8h)
|
|
346
|
+
- Task 6.3: Implement side-by-side output comparison (6h)
|
|
347
|
+
- Task 6.4: Add user scoring interface (4h)
|
|
348
|
+
- Task 6.5: Build feedback submission to database (3h)
|
|
349
|
+
- Task 6.6: Write tests for validation flow (4h)
|
|
350
|
+
|
|
351
|
+
**Dependencies:** R12 (User-in-the-Loop Validation), R13 (Feedback Collection)
|
|
352
|
+
|
|
353
|
+
---
|
|
354
|
+
|
|
355
|
+
### Story 7: Collective Submission & Leaderboards
|
|
356
|
+
|
|
357
|
+
**As a** Claude Code user,
|
|
358
|
+
**I want to** submit my optimized skill to the collective database,
|
|
359
|
+
**So that** others can benefit and my skill can evolve further.
|
|
360
|
+
|
|
361
|
+
**Acceptance Criteria:**
|
|
362
|
+
- [ ] After arena completion, special notification if skill beats database champions
|
|
363
|
+
- [ ] Opt-in prompt to submit to collective
|
|
364
|
+
- [ ] Submission includes: skill content, arena scores (dimensional + overall), weights used, generation/lineage
|
|
365
|
+
- [ ] Privacy-conscious: input hash (not actual input), output samples (500 chars), anonymous user ID
|
|
366
|
+
- [ ] Skills appear in search results for future users
|
|
367
|
+
- [ ] Leaderboard shows top skills by domain
|
|
368
|
+
- [ ] ELO ratings update based on usage and feedback
|
|
369
|
+
|
|
370
|
+
**Task Breakdown Hint:**
|
|
371
|
+
- Task 7.1: Build champion comparison logic (3h)
|
|
372
|
+
- Task 7.2: Create submission UI with privacy controls (6h)
|
|
373
|
+
- Task 7.3: Implement skill submission API endpoint (8h)
|
|
374
|
+
- Task 7.4: Build leaderboard display (6h)
|
|
375
|
+
- Task 7.5: Add ELO rating calculation (5h)
|
|
376
|
+
- Task 7.6: Write tests for submission flow (5h)
|
|
377
|
+
|
|
378
|
+
**Dependencies:** R11 (Collective API & Database), R13 (Feedback Collection), skill lineage tracking
|
|
379
|
+
|
|
380
|
+
---
|
|
381
|
+
|
|
382
|
+
## Functional Requirements
|
|
383
|
+
|
|
384
|
+
### Must Have (P0) - Critical for MVP
|
|
385
|
+
|
|
386
|
+
#### REQ-001: Database-First Query System
|
|
387
|
+
|
|
388
|
+
**Description:** System MUST query Pinecone collective database before generating new skills, showing existing matches to enable reuse.
|
|
389
|
+
|
|
390
|
+
**Acceptance Criteria:**
|
|
391
|
+
- [ ] Extract requirements from user request ("Create PRD skill for comprehensive planning")
|
|
392
|
+
- [ ] Generate requirement fingerprint (SHA-256 hash of structured requirements)
|
|
393
|
+
- [ ] Query Pinecone with semantic search (embedding + metadata filtering)
|
|
394
|
+
- [ ] Return matches with confidence scores (0-1 scale)
|
|
395
|
+
- [ ] Show skills with confidence > 0.7
|
|
396
|
+
- [ ] Display: name, domain, arena scores (dimensional + overall), user ratings, last updated
|
|
397
|
+
- [ ] User can select existing skill or choose "Build custom"
|
|
398
|
+
- [ ] Query completes in < 2 seconds
|
|
399
|
+
|
|
400
|
+
**Technical Specification:**
|
|
401
|
+
```typescript
|
|
402
|
+
interface SkillSearchRequest {
|
|
403
|
+
userRequest: string;
|
|
404
|
+
domain?: string; // Optional: "prd-generation", "pdf-extraction", etc.
|
|
405
|
+
}
|
|
406
|
+
|
|
407
|
+
interface SkillSearchResult {
|
|
408
|
+
skillId: string;
|
|
409
|
+
name: string;
|
|
410
|
+
domain: string;
|
|
411
|
+
confidence: number; // 0-1
|
|
412
|
+
scores: {
|
|
413
|
+
completeness: number;
|
|
414
|
+
clarity: number;
|
|
415
|
+
quality: number;
|
|
416
|
+
efficiency: number;
|
|
417
|
+
overall: number;
|
|
418
|
+
};
|
|
419
|
+
feedback: {
|
|
420
|
+
avgRating: number;
|
|
421
|
+
totalRatings: number;
|
|
422
|
+
successRate: number;
|
|
423
|
+
};
|
|
424
|
+
lastUpdated: string; // ISO 8601
|
|
425
|
+
keyFeatures: string[];
|
|
426
|
+
}
|
|
427
|
+
|
|
428
|
+
// API Call
|
|
429
|
+
POST /api/collective/search
|
|
430
|
+
{
|
|
431
|
+
"userRequest": "Create comprehensive PRD skill",
|
|
432
|
+
"domain": "prd-generation"
|
|
433
|
+
}
|
|
434
|
+
|
|
435
|
+
// Response
|
|
436
|
+
{
|
|
437
|
+
"matches": [
|
|
438
|
+
{
|
|
439
|
+
"skillId": "uuid-123",
|
|
440
|
+
"name": "Comprehensive PRD Generator",
|
|
441
|
+
"confidence": 0.92,
|
|
442
|
+
"scores": { "overall": 91.5, ... },
|
|
443
|
+
"feedback": { "avgRating": 4.7, "totalRatings": 342 },
|
|
444
|
+
...
|
|
445
|
+
}
|
|
446
|
+
],
|
|
447
|
+
"queryTime": 1.2
|
|
448
|
+
}
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
**Task Breakdown:**
|
|
452
|
+
- Implement requirement extraction and fingerprinting: Medium (6h)
|
|
453
|
+
- Build Pinecone semantic search integration: Medium (8h)
|
|
454
|
+
- Add confidence scoring logic: Small (4h)
|
|
455
|
+
- Create search results display: Medium (6h)
|
|
456
|
+
- Write integration tests: Small (4h)
|
|
457
|
+
|
|
458
|
+
**Dependencies:** Pinecone database setup, embedding model (OpenAI text-embedding-3-small)
|
|
459
|
+
|
|
460
|
+
---
|
|
461
|
+
|
|
462
|
+
#### REQ-002: Quick Base Skill Generation (v0.1)
|
|
463
|
+
|
|
464
|
+
**Description:** Generate working base skill in < 30 seconds using quick initial research, deployable immediately.
|
|
465
|
+
|
|
466
|
+
**Acceptance Criteria:**
|
|
467
|
+
- [ ] Quick pattern research: Scan awesome-claude-skills examples (< 10s)
|
|
468
|
+
- [ ] Quick domain research: Basic WebSearch for best practices (< 15s)
|
|
469
|
+
- [ ] Generate domain-specific questions (3-7 questions) (< 5s)
|
|
470
|
+
- [ ] User answers questions or accepts defaults
|
|
471
|
+
- [ ] Generate base SKILL.md from quick research + answers (< 5s)
|
|
472
|
+
- [ ] Deploy to `.claude/skills/[skill-name]/SKILL.md`
|
|
473
|
+
- [ ] Skill immediately usable (activates on triggers)
|
|
474
|
+
- [ ] Total time user request → deployed v0.1: < 30 seconds (excluding user answer time)
|
|
475
|
+
|
|
476
|
+
**Technical Specification:**
|
|
477
|
+
```python
|
|
478
|
+
# Quick Research Flow
|
|
479
|
+
def generate_base_skill(user_request, user_answers):
|
|
480
|
+
# Phase 1: Quick Research (parallel, 15s total)
|
|
481
|
+
pattern_research = quick_scan_patterns(user_request) # 10s
|
|
482
|
+
domain_research = quick_web_search(user_request) # 15s (parallel)
|
|
483
|
+
|
|
484
|
+
# Phase 2: Question Generation (5s)
|
|
485
|
+
questions = generate_questions(pattern_research, domain_research)
|
|
486
|
+
|
|
487
|
+
# User answers (time not counted)
|
|
488
|
+
answers = await user.answer(questions) or get_defaults()
|
|
489
|
+
|
|
490
|
+
# Phase 3: Skill Generation (5s)
|
|
491
|
+
skill_content = generate_skill_md(
|
|
492
|
+
pattern_research,
|
|
493
|
+
domain_research,
|
|
494
|
+
answers
|
|
495
|
+
)
|
|
496
|
+
|
|
497
|
+
# Phase 4: Deployment (1s)
|
|
498
|
+
deploy_skill(skill_content, skill_name)
|
|
499
|
+
|
|
500
|
+
return {"version": "0.1", "path": skill_path, "time": elapsed}
|
|
501
|
+
```
|
|
502
|
+
|
|
503
|
+
**Task Breakdown:**
|
|
504
|
+
- Implement quick pattern scanner: Medium (6h)
|
|
505
|
+
- Build quick domain research: Small (4h)
|
|
506
|
+
- Create question generator agent: Medium (8h)
|
|
507
|
+
- Implement base skill template engine: Medium (6h)
|
|
508
|
+
- Add deployment automation: Small (3h)
|
|
509
|
+
- Write tests for 30s SLA: Small (4h)
|
|
510
|
+
|
|
511
|
+
**Dependencies:** WebSearch tool, pattern database (awesome-claude-skills)
|
|
512
|
+
|
|
513
|
+
---
|
|
514
|
+
|
|
515
|
+
#### REQ-003: Background Arena Orchestration
|
|
516
|
+
|
|
517
|
+
**Description:** After v0.1 delivery, automatically start background arena optimization using Task tool for non-blocking execution.
|
|
518
|
+
|
|
519
|
+
**Acceptance Criteria:**
|
|
520
|
+
- [ ] Arena starts automatically after v0.1 deployed
|
|
521
|
+
- [ ] User can continue working (non-blocking)
|
|
522
|
+
- [ ] Orchestrator-worker pattern: central orchestrator + worker agents
|
|
523
|
+
- [ ] Workers run in parallel via Task tool
|
|
524
|
+
- [ ] State persistence (can resume if interrupted)
|
|
525
|
+
- [ ] Progress indicator (optional, non-intrusive)
|
|
526
|
+
- [ ] Graceful handling of interruptions
|
|
527
|
+
- [ ] User can manually stop optimization
|
|
528
|
+
- [ ] Arena completes in < 30 min for moderate skills (p95)
|
|
529
|
+
|
|
530
|
+
**Technical Specification:**
|
|
531
|
+
```typescript
|
|
532
|
+
interface ArenaOrchestrator {
|
|
533
|
+
skillName: string;
|
|
534
|
+
baseSkill: string; // v0.1 content
|
|
535
|
+
userWeights: WeightConfig;
|
|
536
|
+
complexity: 1-10;
|
|
537
|
+
|
|
538
|
+
async run(): ArenaResult {
|
|
539
|
+
// Start background job
|
|
540
|
+
const jobId = await Task.start({
|
|
541
|
+
subagent_type: "arena-orchestrator",
|
|
542
|
+
prompt: `Run arena optimization for ${skillName}...`,
|
|
543
|
+
async: true
|
|
544
|
+
});
|
|
545
|
+
|
|
546
|
+
// State machine
|
|
547
|
+
while (!converged) {
|
|
548
|
+
// Round N
|
|
549
|
+
variations = await generateVariations(currentBest);
|
|
550
|
+
testData = await generateRealisticTest(round);
|
|
551
|
+
outputs = await Promise.all(
|
|
552
|
+
variations.map(v => executeSkill(v, testData))
|
|
553
|
+
);
|
|
554
|
+
scores = await judgeOutputs(outputs, userWeights);
|
|
555
|
+
currentBest = selectWinner(scores);
|
|
556
|
+
|
|
557
|
+
// Check convergence
|
|
558
|
+
if (shouldStop(scores, elapsed, rounds)) {
|
|
559
|
+
converged = true;
|
|
560
|
+
}
|
|
561
|
+
}
|
|
562
|
+
|
|
563
|
+
return { winner: currentBest, rounds, time: elapsed };
|
|
564
|
+
}
|
|
565
|
+
}
|
|
566
|
+
```
|
|
567
|
+
|
|
568
|
+
**Task Breakdown:**
|
|
569
|
+
- Implement orchestrator-worker pattern: Large (12h)
|
|
570
|
+
- Build state persistence system: Medium (6h)
|
|
571
|
+
- Add graceful interruption handling: Small (4h)
|
|
572
|
+
- Create progress tracking (optional): Small (3h)
|
|
573
|
+
- Write orchestration tests: Medium (8h)
|
|
574
|
+
|
|
575
|
+
**Dependencies:** Claude Code Task tool, R6 (Tournament Arena), R9 (Convergence Detection)
|
|
576
|
+
|
|
577
|
+
---
|
|
578
|
+
|
|
579
|
+
#### REQ-004: Realistic Test Data Generation
|
|
580
|
+
|
|
581
|
+
**Description:** Generate realistic test scenarios using web search and persona-based LLM generation, not generic toy examples.
|
|
582
|
+
|
|
583
|
+
**Acceptance Criteria:**
|
|
584
|
+
- [ ] Web search for domain-specific realistic examples
|
|
585
|
+
- [ ] LLM takes persona appropriate to skill (e.g., "Product Manager" for PRD)
|
|
586
|
+
- [ ] Generates realistic input matching real-world complexity
|
|
587
|
+
- [ ] Example: NOT "Create a PRD" but "Add OAuth 2.0 authentication supporting Google, Microsoft, GitHub to B2B SaaS app with existing auth system, 2FA support, SOC 2 compliance"
|
|
588
|
+
- [ ] Scenarios evolve across rounds: Round 1 (simple), Round 2 (complex), Round 3 (edge case)
|
|
589
|
+
- [ ] Validates realism via domain pattern matching
|
|
590
|
+
- [ ] Caches validated scenarios in Pinecone for reuse
|
|
591
|
+
|
|
592
|
+
**Technical Specification:**
|
|
593
|
+
```python
|
|
594
|
+
# Realistic Test Data Generation
|
|
595
|
+
class RealisticTestGenerator:
|
|
596
|
+
def generate(self, skill_domain, round_num):
|
|
597
|
+
# Step 1: Web search for real examples
|
|
598
|
+
examples = web_search(f"{skill_domain} use cases 2025")
|
|
599
|
+
|
|
600
|
+
# Step 2: Take persona
|
|
601
|
+
persona = get_persona(skill_domain)
|
|
602
|
+
# e.g., "Product Manager at B2B SaaS company"
|
|
603
|
+
|
|
604
|
+
# Step 3: Generate realistic scenario
|
|
605
|
+
scenario = llm.generate(
|
|
606
|
+
prompt=f"""You are a {persona}.
|
|
607
|
+
Generate a realistic request for {skill_domain}.
|
|
608
|
+
Based on these real-world examples: {examples}
|
|
609
|
+
|
|
610
|
+
Complexity level: {get_complexity(round_num)}
|
|
611
|
+
- Round 1: Simple, common use case
|
|
612
|
+
- Round 2: Complex, multi-part scenario
|
|
613
|
+
- Round 3: Edge case, unusual constraints
|
|
614
|
+
|
|
615
|
+
Be specific, include real constraints and context."""
|
|
616
|
+
)
|
|
617
|
+
|
|
618
|
+
# Step 4: Validate realism
|
|
619
|
+
if validate_realism(scenario, examples):
|
|
620
|
+
cache_scenario(skill_domain, scenario)
|
|
621
|
+
return scenario
|
|
622
|
+
else:
|
|
623
|
+
return generate(skill_domain, round_num) # Retry
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
**Task Breakdown:**
|
|
627
|
+
- Build web search integration for examples: Medium (5h)
|
|
628
|
+
- Create persona mapping by domain: Small (4h)
|
|
629
|
+
- Implement scenario generation with evolution: Medium (8h)
|
|
630
|
+
- Add realism validation: Medium (5h)
|
|
631
|
+
- Build scenario caching in Pinecone: Small (4h)
|
|
632
|
+
- Write tests for scenario quality: Medium (6h)
|
|
633
|
+
|
|
634
|
+
**Dependencies:** WebSearch tool, Pinecone database, LLM access
|
|
635
|
+
|
|
636
|
+
---
|
|
637
|
+
|
|
638
|
+
#### REQ-005: Arena Tournament with Pairwise Comparison
|
|
639
|
+
|
|
640
|
+
**Description:** Run tournament battles using Arena-Lite architecture with direct pairwise comparison of real skill outputs.
|
|
641
|
+
|
|
642
|
+
**Acceptance Criteria:**
|
|
643
|
+
- [ ] Round 1: Generate 3 variations (A, B, C)
|
|
644
|
+
- [ ] Execute all variations with identical test input
|
|
645
|
+
- [ ] Capture complete real outputs (not truncated)
|
|
646
|
+
- [ ] Judge compares outputs pairwise (A vs B, B vs C, A vs C)
|
|
647
|
+
- [ ] Judge uses weighted criteria from user questions
|
|
648
|
+
- [ ] Position bias mitigation: randomize output order
|
|
649
|
+
- [ ] Bradley-Terry ranking from pairwise results
|
|
650
|
+
- [ ] Winner (highest rank) advances to next round
|
|
651
|
+
- [ ] Next round: Winner vs 2 new refined variations
|
|
652
|
+
- [ ] Repeat until convergence
|
|
653
|
+
- [ ] Maximum 10 variations tested per round (performance limit)
|
|
654
|
+
|
|
655
|
+
**Technical Specification:**
|
|
656
|
+
```python
|
|
657
|
+
# Arena Tournament Flow
|
|
658
|
+
class ArenaTournament:
|
|
659
|
+
def run_round(self, variations, test_input, weights):
|
|
660
|
+
# 1. Execute all variations
|
|
661
|
+
outputs = []
|
|
662
|
+
for var in variations:
|
|
663
|
+
output = execute_skill(var, test_input)
|
|
664
|
+
outputs.append({
|
|
665
|
+
"variation": var.id,
|
|
666
|
+
"output": output,
|
|
667
|
+
"exec_time": elapsed,
|
|
668
|
+
"tokens": token_count
|
|
669
|
+
})
|
|
670
|
+
|
|
671
|
+
# 2. Pairwise comparisons
|
|
672
|
+
comparisons = []
|
|
673
|
+
for i, out_a in enumerate(outputs):
|
|
674
|
+
for j, out_b in enumerate(outputs[i+1:]):
|
|
675
|
+
# Randomize order to mitigate position bias
|
|
676
|
+
order = random.choice(['AB', 'BA'])
|
|
677
|
+
|
|
678
|
+
result = judge.compare(
|
|
679
|
+
output_a=out_a if order=='AB' else out_b,
|
|
680
|
+
output_b=out_b if order=='AB' else out_a,
|
|
681
|
+
weights=weights,
|
|
682
|
+
criteria=["completeness", "clarity", "quality", "efficiency"]
|
|
683
|
+
)
|
|
684
|
+
|
|
685
|
+
comparisons.append({
|
|
686
|
+
"pair": (out_a.variation, out_b.variation),
|
|
687
|
+
"winner": result.winner,
|
|
688
|
+
"reasoning": result.reasoning,
|
|
689
|
+
"scores": result.dimensional_scores
|
|
690
|
+
})
|
|
691
|
+
|
|
692
|
+
# 3. Bradley-Terry ranking
|
|
693
|
+
rankings = bradley_terry_rank(comparisons)
|
|
694
|
+
|
|
695
|
+
# 4. Select winner
|
|
696
|
+
winner = rankings[0]
|
|
697
|
+
|
|
698
|
+
return {
|
|
699
|
+
"winner": winner,
|
|
700
|
+
"rankings": rankings,
|
|
701
|
+
"outputs": outputs,
|
|
702
|
+
"comparisons": comparisons
|
|
703
|
+
}
|
|
704
|
+
```
|
|
705
|
+
|
|
706
|
+
**Task Breakdown:**
|
|
707
|
+
- Implement skill variation generator: Large (10h)
|
|
708
|
+
- Build skill execution sandbox: Medium (8h)
|
|
709
|
+
- Create output capture system: Small (4h)
|
|
710
|
+
- Implement pairwise judge with bias mitigation: Large (12h)
|
|
711
|
+
- Add Bradley-Terry ranking: Medium (6h)
|
|
712
|
+
- Build tournament loop: Medium (8h)
|
|
713
|
+
- Write comprehensive tests: Large (10h)
|
|
714
|
+
|
|
715
|
+
**Dependencies:** R7 (Skill Execution), R8 (LLM-as-Judge), Arena-Lite algorithm
|
|
716
|
+
|
|
717
|
+
---
|
|
718
|
+
|
|
719
|
+
#### REQ-006: LLM-as-Judge with Weighted Evaluation
|
|
720
|
+
|
|
721
|
+
**Description:** Separate judge model evaluates real skill outputs using user-configured weighted criteria with chain-of-thought reasoning.
|
|
722
|
+
|
|
723
|
+
**Acceptance Criteria:**
|
|
724
|
+
- [ ] Judge model separate from skill execution model (avoid self-evaluation bias)
|
|
725
|
+
- [ ] Judges real outputs, NOT skill code
|
|
726
|
+
- [ ] Weighted dimensions: Completeness, Clarity, Quality, Efficiency (user-configured)
|
|
727
|
+
- [ ] Dimensional scores (0-100) with evidence from outputs
|
|
728
|
+
- [ ] Chain-of-thought reasoning before final score
|
|
729
|
+
- [ ] Anti-verbosity instructions (prefer concise accurate answers)
|
|
730
|
+
- [ ] Position bias mitigation (randomize output order)
|
|
731
|
+
- [ ] Store detailed reasoning for user review
|
|
732
|
+
- [ ] Integration with agentic weighting (failure mode analysis)
|
|
733
|
+
|
|
734
|
+
**Technical Specification:**
|
|
735
|
+
```typescript
|
|
736
|
+
interface JudgeRequest {
|
|
737
|
+
outputA: string;
|
|
738
|
+
outputB: string;
|
|
739
|
+
weights: {
|
|
740
|
+
completeness: number; // 0-100, sums to 100
|
|
741
|
+
clarity: number;
|
|
742
|
+
quality: number;
|
|
743
|
+
efficiency: number;
|
|
744
|
+
};
|
|
745
|
+
domain: string;
|
|
746
|
+
}
|
|
747
|
+
|
|
748
|
+
interface JudgeResponse {
|
|
749
|
+
winner: "A" | "B" | "TIE";
|
|
750
|
+
reasoning: string; // Chain-of-thought
|
|
751
|
+
dimensionalScores: {
|
|
752
|
+
A: { completeness: number; clarity: number; quality: number; efficiency: number; };
|
|
753
|
+
B: { completeness: number; clarity: number; quality: number; efficiency: number; };
|
|
754
|
+
};
|
|
755
|
+
overallScores: {
|
|
756
|
+
A: number; // Weighted average
|
|
757
|
+
B: number;
|
|
758
|
+
};
|
|
759
|
+
evidence: {
|
|
760
|
+
dimension: string;
|
|
761
|
+
winner: "A" | "B";
|
|
762
|
+
example: string; // Specific quote from output
|
|
763
|
+
}[];
|
|
764
|
+
}
|
|
765
|
+
|
|
766
|
+
// Example Judge Prompt
|
|
767
|
+
const judgePrompt = `
|
|
768
|
+
You are an expert evaluator for ${domain} skills.
|
|
769
|
+
|
|
770
|
+
Compare these two outputs for the task: "${testInput}"
|
|
771
|
+
|
|
772
|
+
Output A:
|
|
773
|
+
${outputA}
|
|
774
|
+
|
|
775
|
+
Output B:
|
|
776
|
+
${outputB}
|
|
777
|
+
|
|
778
|
+
Evaluation Criteria (weights):
|
|
779
|
+
- Completeness (${weights.completeness}%): Does it address all requirements?
|
|
780
|
+
- Clarity (${weights.clarity}%): Is it clear and understandable?
|
|
781
|
+
- Quality (${weights.quality}%): Is it high quality and detailed?
|
|
782
|
+
- Efficiency (${weights.efficiency}%): Is it concise without unnecessary verbosity?
|
|
783
|
+
|
|
784
|
+
IMPORTANT: Prefer concise, accurate answers over verbose ones.
|
|
785
|
+
|
|
786
|
+
Step 1: Analyze each dimension
|
|
787
|
+
[Your chain-of-thought reasoning]
|
|
788
|
+
|
|
789
|
+
Step 2: Score each dimension (0-100)
|
|
790
|
+
[Dimensional scores with evidence]
|
|
791
|
+
|
|
792
|
+
Step 3: Calculate weighted overall score
|
|
793
|
+
[Final scores]
|
|
794
|
+
|
|
795
|
+
Winner: [A/B/TIE]
|
|
796
|
+
`;
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
**Task Breakdown:**
|
|
800
|
+
- Implement separate judge model call: Small (3h)
|
|
801
|
+
- Build weighted evaluation logic: Medium (6h)
|
|
802
|
+
- Add chain-of-thought prompting: Small (4h)
|
|
803
|
+
- Implement dimensional scoring: Medium (6h)
|
|
804
|
+
- Add evidence extraction: Medium (5h)
|
|
805
|
+
- Build position bias mitigation: Small (3h)
|
|
806
|
+
- Write judge accuracy tests: Medium (8h)
|
|
807
|
+
|
|
808
|
+
**Dependencies:** LLM API access (Opus for judging), agentic weighting integration
|
|
809
|
+
|
|
810
|
+
---
|
|
811
|
+
|
|
812
|
+
#### REQ-007: Convergence Detection (Multi-Criteria)
|
|
813
|
+
|
|
814
|
+
**Description:** Stop arena when ANY stopping condition met: score plateau, time limit, iteration limit, or target achieved.
|
|
815
|
+
|
|
816
|
+
**Acceptance Criteria:**
|
|
817
|
+
- [ ] Score plateau: Improvement < 2% for 3 consecutive rounds
|
|
818
|
+
- [ ] Time limit: Elapsed time > MAX_TIME (adaptive: 10min simple, 25min moderate, 45min complex)
|
|
819
|
+
- [ ] Iteration limit: Rounds > MAX_ROUNDS (adaptive: 3-10 based on complexity)
|
|
820
|
+
- [ ] Target achieved: Score >= TARGET_SCORE (e.g., 95/100)
|
|
821
|
+
- [ ] User interruption: Manual stop by user
|
|
822
|
+
- [ ] Log convergence reason for transparency
|
|
823
|
+
- [ ] Early stopping prevents wasted computation
|
|
824
|
+
|
|
825
|
+
**Technical Specification:**
|
|
826
|
+
```python
|
|
827
|
+
class ConvergenceDetector:
|
|
828
|
+
def should_stop(self, history, elapsed, complexity):
|
|
829
|
+
# Adaptive limits based on complexity
|
|
830
|
+
MAX_TIME = {1-3: 10, 4-7: 25, 8-10: 45}[complexity] # minutes
|
|
831
|
+
MAX_ROUNDS = {1-3: 3, 4-7: 5, 8-10: 10}[complexity]
|
|
832
|
+
TARGET_SCORE = 95
|
|
833
|
+
|
|
834
|
+
# Criterion 1: Score plateau
|
|
835
|
+
if len(history) >= 3:
|
|
836
|
+
recent = history[-3:]
|
|
837
|
+
improvements = [
|
|
838
|
+
recent[i].score - recent[i-1].score
|
|
839
|
+
for i in range(1, 3)
|
|
840
|
+
]
|
|
841
|
+
if all(imp < 0.02 for imp in improvements):
|
|
842
|
+
return True, "Score plateau: < 2% improvement for 3 rounds"
|
|
843
|
+
|
|
844
|
+
# Criterion 2: Time limit
|
|
845
|
+
if elapsed > MAX_TIME * 60:
|
|
846
|
+
return True, f"Time limit reached ({MAX_TIME} min)"
|
|
847
|
+
|
|
848
|
+
# Criterion 3: Iteration limit
|
|
849
|
+
if len(history) >= MAX_ROUNDS:
|
|
850
|
+
return True, f"Max rounds reached ({MAX_ROUNDS})"
|
|
851
|
+
|
|
852
|
+
# Criterion 4: Target achieved
|
|
853
|
+
if history[-1].score >= TARGET_SCORE:
|
|
854
|
+
return True, f"Target score achieved ({TARGET_SCORE})"
|
|
855
|
+
|
|
856
|
+
# Criterion 5: User interruption (checked elsewhere)
|
|
857
|
+
|
|
858
|
+
return False, None
|
|
859
|
+
```
|
|
860
|
+
|
|
861
|
+
**Task Breakdown:**
|
|
862
|
+
- Implement multi-criteria convergence logic: Medium (6h)
|
|
863
|
+
- Add adaptive limits based on complexity: Small (3h)
|
|
864
|
+
- Build user interruption handling: Small (3h)
|
|
865
|
+
- Add convergence logging: Small (2h)
|
|
866
|
+
- Write convergence tests: Small (4h)
|
|
867
|
+
|
|
868
|
+
**Dependencies:** R14 (Adaptive Complexity)
|
|
869
|
+
|
|
870
|
+
---
|
|
871
|
+
|
|
872
|
+
#### REQ-008: Pinecone Collective Database Integration
|
|
873
|
+
|
|
874
|
+
**Description:** HTTP API for skill storage, search, and feedback with Pinecone vector database backend (no MCP installation required).
|
|
875
|
+
|
|
876
|
+
**Acceptance Criteria:**
|
|
877
|
+
- [ ] Direct HTTP API calls via Bash curl (no MCP to install)
|
|
878
|
+
- [ ] POST /api/collective/search - Query for matching skills
|
|
879
|
+
- [ ] POST /api/collective/submit - Submit winning skill
|
|
880
|
+
- [ ] POST /api/collective/feedback - Submit user rating
|
|
881
|
+
- [ ] GET /api/collective/leaderboard - Top skills by domain
|
|
882
|
+
- [ ] Pinecone schema includes: skill_id, embedding (1536-dim), metadata (scores, weights, feedback, lineage, ELO, usage stats)
|
|
883
|
+
- [ ] Authentication for submissions (API key)
|
|
884
|
+
- [ ] Rate limiting (100 requests/min per user)
|
|
885
|
+
- [ ] Response time < 2s for search queries
|
|
886
|
+
|
|
887
|
+
**Technical Specification:**
|
|
888
|
+
```typescript
|
|
889
|
+
// Pinecone Schema
|
|
890
|
+
interface SkillVector {
|
|
891
|
+
id: string; // skill_id
|
|
892
|
+
values: number[]; // 1536-dim embedding
|
|
893
|
+
metadata: {
|
|
894
|
+
// Basic info
|
|
895
|
+
name: string;
|
|
896
|
+
domain: string;
|
|
897
|
+
skill_content: string; // Full SKILL.md
|
|
898
|
+
|
|
899
|
+
// Dimensional scores
|
|
900
|
+
scores: {
|
|
901
|
+
completeness: number;
|
|
902
|
+
clarity: number;
|
|
903
|
+
quality: number;
|
|
904
|
+
efficiency: number;
|
|
905
|
+
overall: number;
|
|
906
|
+
};
|
|
907
|
+
|
|
908
|
+
// Impact-based weights used
|
|
909
|
+
weights: {
|
|
910
|
+
completeness: number;
|
|
911
|
+
clarity: number;
|
|
912
|
+
quality: number;
|
|
913
|
+
efficiency: number;
|
|
914
|
+
reasoning: string;
|
|
915
|
+
};
|
|
916
|
+
|
|
917
|
+
// Real-world user feedback
|
|
918
|
+
feedback: {
|
|
919
|
+
avg_rating: number;
|
|
920
|
+
total_ratings: number;
|
|
921
|
+
success_rate: number;
|
|
922
|
+
recent_comments: string[];
|
|
923
|
+
};
|
|
924
|
+
|
|
925
|
+
// Lineage & evolution
|
|
926
|
+
generation: number;
|
|
927
|
+
parent_id: string;
|
|
928
|
+
improvement_pct: number;
|
|
929
|
+
|
|
930
|
+
// Rankings
|
|
931
|
+
elo_rating: number;
|
|
932
|
+
leaderboard_rank: number;
|
|
933
|
+
|
|
934
|
+
// Usage stats
|
|
935
|
+
usage_count: number;
|
|
936
|
+
last_used: string; // ISO 8601
|
|
937
|
+
|
|
938
|
+
// Test data
|
|
939
|
+
test_scenarios: string[];
|
|
940
|
+
arena_results_url: string;
|
|
941
|
+
};
|
|
942
|
+
}
|
|
943
|
+
|
|
944
|
+
// API Endpoints
|
|
945
|
+
POST /api/collective/search
|
|
946
|
+
{
|
|
947
|
+
"query": "comprehensive PRD skill",
|
|
948
|
+
"domain": "prd-generation",
|
|
949
|
+
"top_k": 5
|
|
950
|
+
}
|
|
951
|
+
→ { "matches": [...], "queryTime": 1.2 }
|
|
952
|
+
|
|
953
|
+
POST /api/collective/submit
|
|
954
|
+
{
|
|
955
|
+
"skill_content": "...",
|
|
956
|
+
"scores": {...},
|
|
957
|
+
"weights": {...},
|
|
958
|
+
"parent_id": "uuid-123",
|
|
959
|
+
"test_scenarios": [...]
|
|
960
|
+
}
|
|
961
|
+
→ { "skillId": "uuid-456", "rank": 12 }
|
|
962
|
+
|
|
963
|
+
POST /api/collective/feedback
|
|
964
|
+
{
|
|
965
|
+
"skill_id": "uuid-456",
|
|
966
|
+
"rating": 5,
|
|
967
|
+
"comment": "Excellent PRD generation",
|
|
968
|
+
"success": true
|
|
969
|
+
}
|
|
970
|
+
→ { "updated": true }
|
|
971
|
+
|
|
972
|
+
GET /api/collective/leaderboard?domain=prd-generation&limit=10
|
|
973
|
+
→ { "skills": [...], "lastUpdated": "2025-01-22T10:00:00Z" }
|
|
974
|
+
```
|
|
975
|
+
|
|
976
|
+
**Task Breakdown:**
|
|
977
|
+
- Set up Pinecone database and schema: Medium (6h)
|
|
978
|
+
- Implement POST /search endpoint: Medium (8h)
|
|
979
|
+
- Implement POST /submit endpoint: Medium (8h)
|
|
980
|
+
- Implement POST /feedback endpoint: Small (4h)
|
|
981
|
+
- Implement GET /leaderboard endpoint: Small (4h)
|
|
982
|
+
- Add authentication and rate limiting: Medium (6h)
|
|
983
|
+
- Write API integration tests: Medium (8h)
|
|
984
|
+
|
|
985
|
+
**Dependencies:** Pinecone account, OpenAI embeddings API
|
|
986
|
+
|
|
987
|
+
---
|
|
988
|
+
|
|
989
|
+
### Should Have (P1) - Important for Full Experience
|
|
990
|
+
|
|
991
|
+
#### REQ-009: User-in-the-Loop Validation
|
|
992
|
+
|
|
993
|
+
**Description:** Store arena results locally, allow users to review outputs and score them manually for validation and feedback.
|
|
994
|
+
|
|
995
|
+
**Acceptance Criteria:**
|
|
996
|
+
- [ ] All arena results stored in `.claude/skills/[skill-name]/arena_results/`
|
|
997
|
+
- [ ] Format: JSON with inputs, outputs, scores, judge reasoning
|
|
998
|
+
- [ ] User can browse results after arena completes
|
|
999
|
+
- [ ] Side-by-side output comparison UI
|
|
1000
|
+
- [ ] User can score each output (1-5 stars)
|
|
1001
|
+
- [ ] User scores submitted to database (opt-in)
|
|
1002
|
+
- [ ] Feedback improves future arena weights
|
|
1003
|
+
|
|
1004
|
+
**Task Breakdown:**
|
|
1005
|
+
- Create local storage system: Small (4h)
|
|
1006
|
+
- Build results browsing UI: Medium (8h)
|
|
1007
|
+
- Implement comparison view: Medium (6h)
|
|
1008
|
+
- Add user scoring interface: Small (4h)
|
|
1009
|
+
- Build feedback submission: Small (3h)
|
|
1010
|
+
- Write validation tests: Small (4h)
|
|
1011
|
+
|
|
1012
|
+
**Dependencies:** R12 (User-in-the-Loop Validation), Pinecone feedback API
|
|
1013
|
+
|
|
1014
|
+
---
|
|
1015
|
+
|
|
1016
|
+
#### REQ-010: Adaptive Tournament Sizing
|
|
1017
|
+
|
|
1018
|
+
**Description:** Tournament size (variations, rounds) adapts to skill complexity for optimal time/quality trade-off.
|
|
1019
|
+
|
|
1020
|
+
**Acceptance Criteria:**
|
|
1021
|
+
- [ ] Simple skills (1-3): 3 variations, 3 rounds, ~10 min
|
|
1022
|
+
- [ ] Moderate skills (4-7): 5 variations, 5 rounds, ~25 min
|
|
1023
|
+
- [ ] Complex skills (8-10): 7 variations, 7 rounds, ~45 min
|
|
1024
|
+
- [ ] Complexity auto-detected from user request and domain
|
|
1025
|
+
- [ ] User can override default sizing
|
|
1026
|
+
- [ ] Adaptive convergence limits (time, iterations)
|
|
1027
|
+
|
|
1028
|
+
**Task Breakdown:**
|
|
1029
|
+
- Implement complexity detection: Medium (6h)
|
|
1030
|
+
- Build adaptive sizing logic: Small (4h)
|
|
1031
|
+
- Add user override option: Small (2h)
|
|
1032
|
+
- Write adaptive tests: Small (4h)
|
|
1033
|
+
|
|
1034
|
+
**Dependencies:** R14 (Adaptive Complexity)
|
|
1035
|
+
|
|
1036
|
+
---
|
|
1037
|
+
|
|
1038
|
+
#### REQ-011: Skill Lineage Tracking
|
|
1039
|
+
|
|
1040
|
+
**Description:** Track skill evolution over time (parent → child relationships, improvement percentages, generation numbers).
|
|
1041
|
+
|
|
1042
|
+
**Acceptance Criteria:**
|
|
1043
|
+
- [ ] Each skill stores parent_id in metadata
|
|
1044
|
+
- [ ] Generation number auto-increments (parent.gen + 1)
|
|
1045
|
+
- [ ] Improvement percentage calculated vs parent
|
|
1046
|
+
- [ ] Can trace ancestry (skill → parent → grandparent → ...)
|
|
1047
|
+
- [ ] Lineage displayed in search results and leaderboard
|
|
1048
|
+
|
|
1049
|
+
**Task Breakdown:**
|
|
1050
|
+
- Add lineage fields to schema: Small (2h)
|
|
1051
|
+
- Implement ancestry tracking: Small (4h)
|
|
1052
|
+
- Build lineage display UI: Small (4h)
|
|
1053
|
+
- Write lineage tests: Small (3h)
|
|
1054
|
+
|
|
1055
|
+
**Dependencies:** Pinecone schema update
|
|
1056
|
+
|
|
1057
|
+
---
|
|
1058
|
+
|
|
1059
|
+
### Nice to Have (P2) - Future Enhancement
|
|
1060
|
+
|
|
1061
|
+
#### REQ-012: Test Scenario Evolution Across Rounds
|
|
1062
|
+
|
|
1063
|
+
**Description:** Tests get progressively harder across arena rounds (simple → complex → edge case).
|
|
1064
|
+
|
|
1065
|
+
**Acceptance Criteria:**
|
|
1066
|
+
- [ ] Round 1: Simple, common use case
|
|
1067
|
+
- [ ] Round 2: Complex, multi-part scenario
|
|
1068
|
+
- [ ] Round 3: Edge case, unusual constraints
|
|
1069
|
+
- [ ] Winner must excel at all difficulty levels
|
|
1070
|
+
|
|
1071
|
+
**Task Breakdown:**
|
|
1072
|
+
- Implement scenario difficulty progression: Medium (6h)
|
|
1073
|
+
- Write evolution tests: Small (4h)
|
|
1074
|
+
|
|
1075
|
+
**Dependencies:** R5 (Realistic Test Data)
|
|
1076
|
+
|
|
1077
|
+
---
|
|
1078
|
+
|
|
1079
|
+
#### REQ-013: Multi-Model Judge Ensemble
|
|
1080
|
+
|
|
1081
|
+
**Description:** Use multiple judge models (e.g., Opus + Sonnet) and aggregate results for more reliable judgments.
|
|
1082
|
+
|
|
1083
|
+
**Acceptance Criteria:**
|
|
1084
|
+
- [ ] Run same comparison with 2+ judge models
|
|
1085
|
+
- [ ] Aggregate results (majority vote or average scores)
|
|
1086
|
+
- [ ] Higher confidence when judges agree
|
|
1087
|
+
- [ ] Flag for human review when judges disagree
|
|
1088
|
+
|
|
1089
|
+
**Task Breakdown:**
|
|
1090
|
+
- Implement multi-model judging: Medium (8h)
|
|
1091
|
+
- Build aggregation logic: Small (4h)
|
|
1092
|
+
- Add disagreement detection: Small (3h)
|
|
1093
|
+
|
|
1094
|
+
**Dependencies:** Access to multiple LLM models
|
|
1095
|
+
|
|
1096
|
+
---
|
|
1097
|
+
|
|
1098
|
+
## Non-Functional Requirements
|
|
1099
|
+
|
|
1100
|
+
### Performance
|
|
1101
|
+
|
|
1102
|
+
**Response Time:**
|
|
1103
|
+
- Database search queries: < 2 seconds (p95)
|
|
1104
|
+
- Base skill (v0.1) generation: < 30 seconds total
|
|
1105
|
+
- Arena completion: < 30 minutes for moderate skills (p95)
|
|
1106
|
+
- LLM judge comparison: < 10 seconds per pairwise comparison
|
|
1107
|
+
|
|
1108
|
+
**Throughput:**
|
|
1109
|
+
- Support 100 concurrent users creating skills
|
|
1110
|
+
- Database can handle 1000 queries/hour
|
|
1111
|
+
- Arena can run 10 background jobs concurrently
|
|
1112
|
+
|
|
1113
|
+
**Resource Usage:**
|
|
1114
|
+
- Local storage: < 100MB per skill (arena results)
|
|
1115
|
+
- Memory: < 2GB for background arena process
|
|
1116
|
+
- Network: Minimize API calls (batch where possible)
|
|
1117
|
+
|
|
1118
|
+
---
|
|
1119
|
+
|
|
1120
|
+
### Security
|
|
1121
|
+
|
|
1122
|
+
**Authentication:**
|
|
1123
|
+
- API key required for database submissions
|
|
1124
|
+
- User ID anonymized in database (hash)
|
|
1125
|
+
- No PII stored in collective database
|
|
1126
|
+
|
|
1127
|
+
**Data Protection:**
|
|
1128
|
+
- Skill content public (stored in database)
|
|
1129
|
+
- User inputs hashed (not stored in plaintext)
|
|
1130
|
+
- Output samples truncated (500 chars max)
|
|
1131
|
+
- Local arena results private (stored on user machine)
|
|
1132
|
+
|
|
1133
|
+
**Privacy:**
|
|
1134
|
+
- Opt-in for database submission
|
|
1135
|
+
- Opt-in for feedback collection
|
|
1136
|
+
- YAML flag: `collect-feedback: true/false`
|
|
1137
|
+
- Can disable via setting
|
|
1138
|
+
|
|
1139
|
+
---
|
|
1140
|
+
|
|
1141
|
+
### Scalability
|
|
1142
|
+
|
|
1143
|
+
**User Load:**
|
|
1144
|
+
- Support 1000 active users in first 3 months
|
|
1145
|
+
- Scale to 10,000 users within 1 year
|
|
1146
|
+
- Serverless API (auto-scales)
|
|
1147
|
+
|
|
1148
|
+
**Database Volume:**
|
|
1149
|
+
- Initial: 100 skills
|
|
1150
|
+
- Growth: 100-500 skills/month
|
|
1151
|
+
- Storage: 1GB vector database initially
|
|
1152
|
+
- Pinecone free tier: 1 index, 100k vectors (sufficient for MVP)
|
|
1153
|
+
|
|
1154
|
+
**Arena Jobs:**
|
|
1155
|
+
- 10 concurrent background jobs per user machine
|
|
1156
|
+
- Each job runs 10-45 min
|
|
1157
|
+
- State persistence allows resume
|
|
1158
|
+
|
|
1159
|
+
---
|
|
1160
|
+
|
|
1161
|
+
### Reliability
|
|
1162
|
+
|
|
1163
|
+
**Uptime:**
|
|
1164
|
+
- API SLA: 99% monthly uptime
|
|
1165
|
+
- Graceful degradation if database unavailable (create skills without search)
|
|
1166
|
+
- Resume capability if arena interrupted
|
|
1167
|
+
|
|
1168
|
+
**Error Handling:**
|
|
1169
|
+
- Retry logic for API failures (3 retries with exponential backoff)
|
|
1170
|
+
- Timeout for skill execution (5 min max per skill)
|
|
1171
|
+
- Fallback to defaults if web search fails
|
|
1172
|
+
|
|
1173
|
+
**Monitoring:**
|
|
1174
|
+
- Track arena completion rate
|
|
1175
|
+
- Alert on high failure rate (> 10%)
|
|
1176
|
+
- Log all convergence reasons
|
|
1177
|
+
|
|
1178
|
+
---
|
|
1179
|
+
|
|
1180
|
+
### Compatibility
|
|
1181
|
+
|
|
1182
|
+
**Claude Code Version:**
|
|
1183
|
+
- Requires Claude Code with Task tool support
|
|
1184
|
+
- Compatible with latest stable release
|
|
1185
|
+
|
|
1186
|
+
**System Requirements:**
|
|
1187
|
+
- Linux, macOS, Windows (WSL2)
|
|
1188
|
+
- Internet connection for API calls
|
|
1189
|
+
- Disk space: 1GB free for arena results
|
|
1190
|
+
|
|
1191
|
+
**Dependencies:**
|
|
1192
|
+
- Bash (for curl API calls)
|
|
1193
|
+
- No MCP installation required
|
|
1194
|
+
- No additional software needed
|
|
1195
|
+
|
|
1196
|
+
---
|
|
1197
|
+
|
|
1198
|
+
## Technical Considerations
|
|
1199
|
+
|
|
1200
|
+
### System Architecture
|
|
1201
|
+
|
|
1202
|
+
**Current Architecture:**
|
|
1203
|
+
Claude Code skill system with:
|
|
1204
|
+
- Skills stored in `~/.claude/skills/[skill-name]/SKILL.md`
|
|
1205
|
+
- YAML frontmatter for metadata
|
|
1206
|
+
- Activation via triggers in user messages
|
|
1207
|
+
- No optimization or validation currently
|
|
1208
|
+
|
|
1209
|
+
**Proposed Architecture:**
|
|
1210
|
+
|
|
1211
|
+
```
|
|
1212
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
1213
|
+
│ Claude Code (User Machine) │
|
|
1214
|
+
│ │
|
|
1215
|
+
│ ┌──────────────────────────────────────────────────────┐ │
|
|
1216
|
+
│ │ skill_creating Skill (Enhanced) │ │
|
|
1217
|
+
│ │ ┌────────────────────────────────────────────────┐ │ │
|
|
1218
|
+
│ │ │ 1. Database-First Query │ │ │
|
|
1219
|
+
│ │ │ └─> Pinecone Search API │ │ │
|
|
1220
|
+
│ │ └────────────────────────────────────────────────┘ │ │
|
|
1221
|
+
│ │ ┌────────────────────────────────────────────────┐ │ │
|
|
1222
|
+
│ │ │ 2. Quick Base Generation (v0.1) │ │ │
|
|
1223
|
+
│ │ │ ├─> Pattern Research (awesome-claude) │ │ │
|
|
1224
|
+
│ │ │ ├─> Domain Research (WebSearch) │ │ │
|
|
1225
|
+
│ │ │ ├─> Question Generator Agent │ │ │
|
|
1226
|
+
│ │ │ └─> Base Skill Template │ │ │
|
|
1227
|
+
│ │ └────────────────────────────────────────────────┘ │ │
|
|
1228
|
+
│ │ ┌────────────────────────────────────────────────┐ │ │
|
|
1229
|
+
│ │ │ 3. Background Arena (via Task tool) │ │ │
|
|
1230
|
+
│ │ │ │ │ │
|
|
1231
|
+
│ │ │ Orchestrator Agent │ │ │
|
|
1232
|
+
│ │ │ ├─> Deep Research Agent │ │ │
|
|
1233
|
+
│ │ │ ├─> Test Data Generator Agent │ │ │
|
|
1234
|
+
│ │ │ │ └─> WebSearch (realistic examples) │ │ │
|
|
1235
|
+
│ │ │ ├─> Variation Generator Agent │ │ │
|
|
1236
|
+
│ │ │ ├─> Execution Workers (parallel) │ │ │
|
|
1237
|
+
│ │ │ │ ├─ Worker A: Execute Skill A │ │ │
|
|
1238
|
+
│ │ │ │ ├─ Worker B: Execute Skill B │ │ │
|
|
1239
|
+
│ │ │ │ └─ Worker C: Execute Skill C │ │ │
|
|
1240
|
+
│ │ │ ├─> Judge Agent (LLM-as-judge) │ │ │
|
|
1241
|
+
│ │ │ │ └─> Opus (separate from exec) │ │ │
|
|
1242
|
+
│ │ │ └─> Synthesis Agent (Bradley-Terry) │ │ │
|
|
1243
|
+
│ │ │ │ │ │
|
|
1244
|
+
│ │ │ State Persistence: Local JSON │ │ │
|
|
1245
|
+
│ │ │ Arena Results: .claude/skills/.../arena... │ │ │
|
|
1246
|
+
│ │ └────────────────────────────────────────────────┘ │ │
|
|
1247
|
+
│ │ ┌────────────────────────────────────────────────┐ │ │
|
|
1248
|
+
│ │ │ 4. User Validation & Submission │ │ │
|
|
1249
|
+
│ │ │ ├─> Review UI (compare outputs) │ │ │
|
|
1250
|
+
│ │ │ ├─> User Scoring (1-5 stars) │ │ │
|
|
1251
|
+
│ │ │ └─> Submit to Collective (opt-in) │ │ │
|
|
1252
|
+
│ │ └────────────────────────────────────────────────┘ │ │
|
|
1253
|
+
│ └──────────────────────────────────────────────────────┘ │
|
|
1254
|
+
└─────────────────────────────────────────────────────────────┘
|
|
1255
|
+
│
|
|
1256
|
+
│ HTTP API (curl)
|
|
1257
|
+
▼
|
|
1258
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
1259
|
+
│ Collective API (Serverless) │
|
|
1260
|
+
│ │
|
|
1261
|
+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
1262
|
+
│ │ POST /search │ │ POST /submit │ │ POST /feedback│ │
|
|
1263
|
+
│ └──────┬───────┘ └──────┬───────┘ └──────┬────────┘ │
|
|
1264
|
+
│ │ │ │ │
|
|
1265
|
+
│ └──────────────────┼──────────────────┘ │
|
|
1266
|
+
│ ▼ │
|
|
1267
|
+
│ ┌────────────────────────────────────────────────────┐ │
|
|
1268
|
+
│ │ Pinecone Vector Database │ │
|
|
1269
|
+
│ │ ┌──────────────────────────────────────────────┐ │ │
|
|
1270
|
+
│ │ │ Index: claude-skills │ │ │
|
|
1271
|
+
│ │ │ - Embeddings (1536-dim) │ │ │
|
|
1272
|
+
│ │ │ - Metadata (scores, weights, feedback, ELO) │ │ │
|
|
1273
|
+
│ │ │ - Semantic search │ │ │
|
|
1274
|
+
│ │ └──────────────────────────────────────────────┘ │ │
|
|
1275
|
+
│ └────────────────────────────────────────────────────┘ │
|
|
1276
|
+
│ │
|
|
1277
|
+
│ ┌──────────────┐ ┌──────────────┐ │
|
|
1278
|
+
│ │ GET /leaderboard│ │ Auth & Rate │ │
|
|
1279
|
+
│ │ │ │ Limiting │ │
|
|
1280
|
+
│ └──────────────┘ └──────────────┘ │
|
|
1281
|
+
└─────────────────────────────────────────────────────────────┘
|
|
1282
|
+
│
|
|
1283
|
+
│ OpenAI API
|
|
1284
|
+
▼
|
|
1285
|
+
┌──────────────────┐
|
|
1286
|
+
│ OpenAI Embeddings│
|
|
1287
|
+
│ text-embed-3-small│
|
|
1288
|
+
└──────────────────┘
|
|
1289
|
+
```
|
|
1290
|
+
|
|
1291
|
+
**Key Components:**
|
|
1292
|
+
|
|
1293
|
+
1. **skill_creating Skill (Enhanced):**
|
|
1294
|
+
- Main entry point for user requests
|
|
1295
|
+
- Orchestrates entire workflow
|
|
1296
|
+
- Uses Task tool for background jobs
|
|
1297
|
+
|
|
1298
|
+
2. **Agentic Workers:**
|
|
1299
|
+
- **Question Generator:** Domain-specific questions
|
|
1300
|
+
- **Research Agents:** Quick (30s) + Deep (background)
|
|
1301
|
+
- **Test Data Generator:** Realistic scenarios via web search
|
|
1302
|
+
- **Variation Generator:** Creates competing skill variations
|
|
1303
|
+
- **Execution Workers:** Run skills in isolation (parallel)
|
|
1304
|
+
- **Judge Agent:** LLM-as-judge with weighted criteria
|
|
1305
|
+
- **Synthesis Agent:** Bradley-Terry ranking, convergence detection
|
|
1306
|
+
|
|
1307
|
+
3. **Collective API:**
|
|
1308
|
+
- Serverless HTTP API (no MCP required)
|
|
1309
|
+
- Pinecone vector database backend
|
|
1310
|
+
- Authentication, rate limiting
|
|
1311
|
+
- Leaderboard and feedback system
|
|
1312
|
+
|
|
1313
|
+
4. **Local Storage:**
|
|
1314
|
+
- Arena results: `.claude/skills/[skill-name]/arena_results/`
|
|
1315
|
+
- State persistence for resume capability
|
|
1316
|
+
- User can review and validate
|
|
1317
|
+
|
|
1318
|
+
---
|
|
1319
|
+
|
|
1320
|
+
### API Specifications
|
|
1321
|
+
|
|
1322
|
+
See REQ-008 for detailed API specs.
|
|
1323
|
+
|
|
1324
|
+
**Key Endpoints:**
|
|
1325
|
+
- `POST /api/collective/search` - Query skills
|
|
1326
|
+
- `POST /api/collective/submit` - Submit skill
|
|
1327
|
+
- `POST /api/collective/feedback` - Submit rating
|
|
1328
|
+
- `GET /api/collective/leaderboard` - Top skills
|
|
1329
|
+
|
|
1330
|
+
**Authentication:**
|
|
1331
|
+
```bash
|
|
1332
|
+
curl -X POST https://api.collective.claude-skills.ai/search \
|
|
1333
|
+
-H "Authorization: Bearer ${API_KEY}" \
|
|
1334
|
+
-H "Content-Type: application/json" \
|
|
1335
|
+
-d '{"query": "PRD skill", "domain": "prd-generation"}'
|
|
1336
|
+
```
|
|
1337
|
+
|
|
1338
|
+
---
|
|
1339
|
+
|
|
1340
|
+
### Technology Stack
|
|
1341
|
+
|
|
1342
|
+
**Frontend (User-Facing):**
|
|
1343
|
+
- Claude Code CLI interface
|
|
1344
|
+
- Text-based UI for questions, results, comparison
|
|
1345
|
+
- Markdown formatting for output
|
|
1346
|
+
|
|
1347
|
+
**Backend (Skill Logic):**
|
|
1348
|
+
- Language: Integrated with Claude Code (no separate backend)
|
|
1349
|
+
- Agentic orchestration: Claude Code Task tool
|
|
1350
|
+
- State management: Local JSON files
|
|
1351
|
+
- LLM calls: Anthropic API (Sonnet for execution, Opus for judging)
|
|
1352
|
+
|
|
1353
|
+
**Database:**
|
|
1354
|
+
- Vector DB: Pinecone (free tier for MVP)
|
|
1355
|
+
- Embeddings: OpenAI text-embedding-3-small (1536 dimensions)
|
|
1356
|
+
- Storage format: JSON metadata + vector embeddings
|
|
1357
|
+
|
|
1358
|
+
**Infrastructure:**
|
|
1359
|
+
- Execution: User's machine (local)
|
|
1360
|
+
- API: Serverless (Vercel, AWS Lambda, or Cloudflare Workers)
|
|
1361
|
+
- No additional installation required (uses existing Claude Code)
|
|
1362
|
+
|
|
1363
|
+
**External Dependencies:**
|
|
1364
|
+
- WebSearch tool (Claude Code built-in)
|
|
1365
|
+
- WebFetch tool (Claude Code built-in)
|
|
1366
|
+
- Anthropic API (Claude Sonnet, Opus)
|
|
1367
|
+
- OpenAI Embeddings API
|
|
1368
|
+
- Pinecone API
|
|
1369
|
+
|
|
1370
|
+
---
|
|
1371
|
+
|
|
1372
|
+
### External Dependencies
|
|
1373
|
+
|
|
1374
|
+
**Third-Party Services:**
|
|
1375
|
+
|
|
1376
|
+
1. **Pinecone Vector Database:**
|
|
1377
|
+
- Purpose: Collective skill storage and semantic search
|
|
1378
|
+
- API: https://docs.pinecone.io/
|
|
1379
|
+
- Rate Limits: 100 requests/second (free tier)
|
|
1380
|
+
- Fallback: If down, create skills without search (graceful degradation)
|
|
1381
|
+
- Cost: Free tier (1 index, 100k vectors)
|
|
1382
|
+
|
|
1383
|
+
2. **OpenAI Embeddings API:**
|
|
1384
|
+
- Purpose: Generate 1536-dim embeddings for semantic search
|
|
1385
|
+
- Model: text-embedding-3-small
|
|
1386
|
+
- Rate Limits: 3000 requests/min
|
|
1387
|
+
- Fallback: Cache embeddings, retry on failure
|
|
1388
|
+
- Cost: $0.02 per 1M tokens (~$0.0001 per skill)
|
|
1389
|
+
|
|
1390
|
+
3. **Anthropic API:**
|
|
1391
|
+
- Purpose: LLM calls for skill execution and judging
|
|
1392
|
+
- Models: Sonnet (execution), Opus (judging)
|
|
1393
|
+
- Rate Limits: Per user API key
|
|
1394
|
+
- Fallback: Use Sonnet for judging if Opus unavailable
|
|
1395
|
+
- Cost: User pays via their own API key
|
|
1396
|
+
|
|
1397
|
+
**Internal Dependencies:**
|
|
1398
|
+
- **Claude Code Task tool:** Required for background execution
|
|
1399
|
+
- **WebSearch tool:** For realistic test scenario discovery
|
|
1400
|
+
- **WebFetch tool:** For pattern research (awesome-claude-skills)
|
|
1401
|
+
- **Bash tool:** For curl API calls to collective database
|
|
1402
|
+
|
|
1403
|
+
---
|
|
1404
|
+
|
|
1405
|
+
### Migration Strategy
|
|
1406
|
+
|
|
1407
|
+
**For Existing skill_creating Skill:**
|
|
1408
|
+
|
|
1409
|
+
1. **Phase 1: Enhance Existing Skill**
|
|
1410
|
+
- Add database-first query to SKILL.md workflow
|
|
1411
|
+
- Implement quick base generation (v0.1)
|
|
1412
|
+
- No breaking changes to current functionality
|
|
1413
|
+
|
|
1414
|
+
2. **Phase 2: Add Background Arena**
|
|
1415
|
+
- Implement arena orchestration via Task tool
|
|
1416
|
+
- Runs after v0.1 delivery (non-blocking)
|
|
1417
|
+
- Opt-in initially (user can skip arena)
|
|
1418
|
+
|
|
1419
|
+
3. **Phase 3: Collective Database Integration**
|
|
1420
|
+
- Deploy serverless API
|
|
1421
|
+
- Set up Pinecone database
|
|
1422
|
+
- Enable submissions
|
|
1423
|
+
|
|
1424
|
+
4. **Phase 4: Gradual Feature Rollout**
|
|
1425
|
+
- Week 1-2: Database search only
|
|
1426
|
+
- Week 3-4: Base generation + optional arena
|
|
1427
|
+
- Week 5-6: Full arena with convergence
|
|
1428
|
+
- Week 7-8: User validation and feedback
|
|
1429
|
+
|
|
1430
|
+
5. **Phase 5: Optimize Based on Usage**
|
|
1431
|
+
- Monitor arena completion rates
|
|
1432
|
+
- Adjust convergence criteria
|
|
1433
|
+
- Improve test data generation
|
|
1434
|
+
|
|
1435
|
+
**Rollback Plan:**
|
|
1436
|
+
- Disable database queries (fall back to current behavior)
|
|
1437
|
+
- Skip arena optimization (deploy v0.1 only)
|
|
1438
|
+
- Feature flags control each component independently
|
|
1439
|
+
|
|
1440
|
+
---
|
|
1441
|
+
|
|
1442
|
+
### Testing Strategy
|
|
1443
|
+
|
|
1444
|
+
**Unit Tests:**
|
|
1445
|
+
- Test coverage: > 80% for new code
|
|
1446
|
+
- Key areas:
|
|
1447
|
+
- Requirement fingerprint generation
|
|
1448
|
+
- Question generation by domain
|
|
1449
|
+
- Skill variation creation
|
|
1450
|
+
- Pairwise comparison logic
|
|
1451
|
+
- Bradley-Terry ranking
|
|
1452
|
+
- Convergence detection
|
|
1453
|
+
- API request/response handling
|
|
1454
|
+
|
|
1455
|
+
**Integration Tests:**
|
|
1456
|
+
- Full workflow tests:
|
|
1457
|
+
- Database query → results display → user selection
|
|
1458
|
+
- Quick research → base skill → deployment
|
|
1459
|
+
- Arena orchestration → workers → judge → synthesis
|
|
1460
|
+
- Local storage → user review → feedback submission
|
|
1461
|
+
|
|
1462
|
+
**E2E Tests:**
|
|
1463
|
+
- User journeys:
|
|
1464
|
+
- New user creates PRD skill (finds existing, chooses it)
|
|
1465
|
+
- User builds custom skill (v0.1 → arena → v1.0)
|
|
1466
|
+
- User reviews arena outputs and scores them
|
|
1467
|
+
- User submits winning skill to collective
|
|
1468
|
+
- Another user discovers submitted skill via search
|
|
1469
|
+
|
|
1470
|
+
**Performance Tests:**
|
|
1471
|
+
- Base skill generation completes < 30s (95th percentile)
|
|
1472
|
+
- Database query completes < 2s (95th percentile)
|
|
1473
|
+
- Arena completes < 30 min for moderate skills (95th percentile)
|
|
1474
|
+
- API endpoints respond < 2s (95th percentile)
|
|
1475
|
+
|
|
1476
|
+
**Quality Tests:**
|
|
1477
|
+
- Realistic test scenarios validated by domain experts
|
|
1478
|
+
- Judge consistency: Same comparison run 3x should agree ≥ 80%
|
|
1479
|
+
- Score improvement: v1.0 beats v0.1 in ≥ 80% of cases
|
|
1480
|
+
- User satisfaction: ≥ 4.5/5 average rating
|
|
1481
|
+
|
|
1482
|
+
**Security Tests:**
|
|
1483
|
+
- API authentication required for submissions
|
|
1484
|
+
- Rate limiting prevents abuse
|
|
1485
|
+
- No PII leaked in database
|
|
1486
|
+
- Anonymous user IDs cannot be reversed
|
|
1487
|
+
|
|
1488
|
+
---
|
|
1489
|
+
|
|
1490
|
+
## Implementation Roadmap
|
|
1491
|
+
|
|
1492
|
+
### Phase 1: Foundation (Weeks 1-3)
|
|
1493
|
+
|
|
1494
|
+
**Goal:** Database setup, basic API, requirement extraction, quick research
|
|
1495
|
+
|
|
1496
|
+
**Tasks:**
|
|
1497
|
+
|
|
1498
|
+
- [ ] **Task 1.1:** Set up Pinecone database and schema
|
|
1499
|
+
- Complexity: Medium (6h)
|
|
1500
|
+
- Dependencies: None
|
|
1501
|
+
- Owner: Backend team
|
|
1502
|
+
- Deliverable: Pinecone index created, schema documented
|
|
1503
|
+
|
|
1504
|
+
- [ ] **Task 1.2:** Implement OpenAI embeddings generation
|
|
1505
|
+
- Complexity: Small (3h)
|
|
1506
|
+
- Dependencies: Task 1.1
|
|
1507
|
+
- Owner: Backend team
|
|
1508
|
+
- Deliverable: Function to generate 1536-dim vectors
|
|
1509
|
+
|
|
1510
|
+
- [ ] **Task 1.3:** Build POST /search API endpoint
|
|
1511
|
+
- Complexity: Medium (8h)
|
|
1512
|
+
- Dependencies: Tasks 1.1, 1.2
|
|
1513
|
+
- Owner: Backend team
|
|
1514
|
+
- Deliverable: Working semantic search API
|
|
1515
|
+
|
|
1516
|
+
- [ ] **Task 1.4:** Implement requirement extraction and fingerprinting
|
|
1517
|
+
- Complexity: Medium (6h)
|
|
1518
|
+
- Dependencies: None
|
|
1519
|
+
- Owner: Agent team
|
|
1520
|
+
- Deliverable: Parse user requests into structured requirements
|
|
1521
|
+
|
|
1522
|
+
- [ ] **Task 1.5:** Build quick pattern research (awesome-claude-skills scan)
|
|
1523
|
+
- Complexity: Small (4h)
|
|
1524
|
+
- Dependencies: None
|
|
1525
|
+
- Owner: Agent team
|
|
1526
|
+
- Deliverable: Quick scan returns relevant patterns in <10s
|
|
1527
|
+
|
|
1528
|
+
- [ ] **Task 1.6:** Build quick domain research (WebSearch integration)
|
|
1529
|
+
- Complexity: Medium (6h)
|
|
1530
|
+
- Dependencies: None
|
|
1531
|
+
- Owner: Agent team
|
|
1532
|
+
- Deliverable: Web search returns best practices in <15s
|
|
1533
|
+
|
|
1534
|
+
**Validation Checkpoint:**
|
|
1535
|
+
- [ ] Can query database and get relevant results in < 2s
|
|
1536
|
+
- [ ] Can extract requirements from user request
|
|
1537
|
+
- [ ] Quick research completes in < 20s total
|
|
1538
|
+
- [ ] Unit tests passing for all components
|
|
1539
|
+
|
|
1540
|
+
**Total Effort:** ~33 hours (~1.5 weeks with 2-person team)
|
|
1541
|
+
|
|
1542
|
+
---
|
|
1543
|
+
|
|
1544
|
+
### Phase 2: Quick Base Generation (Weeks 4-5)
|
|
1545
|
+
|
|
1546
|
+
**Goal:** Question generation, base skill creation, v0.1 deployment
|
|
1547
|
+
|
|
1548
|
+
**Tasks:**
|
|
1549
|
+
|
|
1550
|
+
- [ ] **Task 2.1:** Build question generator agent
|
|
1551
|
+
- Complexity: Medium (8h)
|
|
1552
|
+
- Dependencies: Phase 1 complete (uses quick research)
|
|
1553
|
+
- Owner: Agent team
|
|
1554
|
+
- Deliverable: Generates 3-7 domain-specific questions
|
|
1555
|
+
|
|
1556
|
+
- [ ] **Task 2.2:** Implement answer-to-weight conversion
|
|
1557
|
+
- Complexity: Medium (5h)
|
|
1558
|
+
- Dependencies: Task 2.1
|
|
1559
|
+
- Owner: Agent team
|
|
1560
|
+
- Deliverable: User answers → weighted criteria (e.g., Quality: 64%)
|
|
1561
|
+
|
|
1562
|
+
- [ ] **Task 2.3:** Create base skill template engine
|
|
1563
|
+
- Complexity: Medium (6h)
|
|
1564
|
+
- Dependencies: Quick research, questions
|
|
1565
|
+
- Owner: Agent team
|
|
1566
|
+
- Deliverable: Generates SKILL.md from inputs
|
|
1567
|
+
|
|
1568
|
+
- [ ] **Task 2.4:** Implement skill deployment automation
|
|
1569
|
+
- Complexity: Small (3h)
|
|
1570
|
+
- Dependencies: Task 2.3
|
|
1571
|
+
- Owner: Agent team
|
|
1572
|
+
- Deliverable: Writes SKILL.md to `.claude/skills/[name]/`
|
|
1573
|
+
|
|
1574
|
+
- [ ] **Task 2.5:** Build database-first workflow integration
|
|
1575
|
+
- Complexity: Medium (8h)
|
|
1576
|
+
- Dependencies: Task 1.3, Task 2.4
|
|
1577
|
+
- Owner: Integration team
|
|
1578
|
+
- Deliverable: Full flow: query → show results → user select → deploy
|
|
1579
|
+
|
|
1580
|
+
- [ ] **Task 2.6:** Add optional quick scoring of v0.1
|
|
1581
|
+
- Complexity: Small (4h)
|
|
1582
|
+
- Dependencies: Task 2.4
|
|
1583
|
+
- Owner: Agent team
|
|
1584
|
+
- Deliverable: User can score v0.1 immediately (optional)
|
|
1585
|
+
|
|
1586
|
+
**Validation Checkpoint:**
|
|
1587
|
+
- [ ] Full base generation completes in < 30s (excluding user answer time)
|
|
1588
|
+
- [ ] Generated v0.1 skill is valid and activates correctly
|
|
1589
|
+
- [ ] Database query → user selection → deployment works end-to-end
|
|
1590
|
+
- [ ] Integration tests passing
|
|
1591
|
+
|
|
1592
|
+
**Total Effort:** ~34 hours (~1.5 weeks)
|
|
1593
|
+
|
|
1594
|
+
---
|
|
1595
|
+
|
|
1596
|
+
### Phase 3: Arena Core (Weeks 6-8)
|
|
1597
|
+
|
|
1598
|
+
**Goal:** Background orchestration, skill execution, realistic test data, pairwise judge
|
|
1599
|
+
|
|
1600
|
+
**Tasks:**
|
|
1601
|
+
|
|
1602
|
+
- [ ] **Task 3.1:** Implement orchestrator-worker pattern via Task tool
|
|
1603
|
+
- Complexity: Large (12h)
|
|
1604
|
+
- Dependencies: None (new subsystem)
|
|
1605
|
+
- Owner: Agent team
|
|
1606
|
+
- Deliverable: Orchestrator manages arena rounds in background
|
|
1607
|
+
|
|
1608
|
+
- [ ] **Task 3.2:** Build skill variation generator agent
|
|
1609
|
+
- Complexity: Large (10h)
|
|
1610
|
+
- Dependencies: Base skill template
|
|
1611
|
+
- Owner: Agent team
|
|
1612
|
+
- Deliverable: Creates 3 variations (A, B, C) from base
|
|
1613
|
+
|
|
1614
|
+
- [ ] **Task 3.3:** Build realistic test data generator agent
|
|
1615
|
+
- Complexity: Medium (8h)
|
|
1616
|
+
- Dependencies: WebSearch tool
|
|
1617
|
+
- Owner: Agent team
|
|
1618
|
+
- Deliverable: Generates realistic scenarios using personas
|
|
1619
|
+
|
|
1620
|
+
- [ ] **Task 3.4:** Implement skill execution sandbox
|
|
1621
|
+
- Complexity: Medium (8h)
|
|
1622
|
+
- Dependencies: None
|
|
1623
|
+
- Owner: Execution team
|
|
1624
|
+
- Deliverable: Isolated skill execution with timeout
|
|
1625
|
+
|
|
1626
|
+
- [ ] **Task 3.5:** Build output capture system
|
|
1627
|
+
- Complexity: Small (4h)
|
|
1628
|
+
- Dependencies: Task 3.4
|
|
1629
|
+
- Owner: Execution team
|
|
1630
|
+
- Deliverable: Captures full skill outputs (no truncation)
|
|
1631
|
+
|
|
1632
|
+
- [ ] **Task 3.6:** Implement LLM-as-judge with pairwise comparison
|
|
1633
|
+
- Complexity: Large (12h)
|
|
1634
|
+
- Dependencies: Weighted criteria from questions
|
|
1635
|
+
- Owner: Judge team
|
|
1636
|
+
- Deliverable: Compares 2 outputs, returns winner with reasoning
|
|
1637
|
+
|
|
1638
|
+
- [ ] **Task 3.7:** Add position bias mitigation (randomize order)
|
|
1639
|
+
- Complexity: Small (3h)
|
|
1640
|
+
- Dependencies: Task 3.6
|
|
1641
|
+
- Owner: Judge team
|
|
1642
|
+
- Deliverable: Multiple comparisons with order flipped
|
|
1643
|
+
|
|
1644
|
+
- [ ] **Task 3.8:** Implement Bradley-Terry ranking
|
|
1645
|
+
- Complexity: Medium (6h)
|
|
1646
|
+
- Dependencies: Task 3.6
|
|
1647
|
+
- Owner: Judge team
|
|
1648
|
+
- Deliverable: Ranks variations from pairwise results
|
|
1649
|
+
|
|
1650
|
+
**Validation Checkpoint:**
|
|
1651
|
+
- [ ] Can generate 3 variations from base skill
|
|
1652
|
+
- [ ] Can execute all variations with same test input
|
|
1653
|
+
- [ ] Judge compares outputs and selects winner
|
|
1654
|
+
- [ ] Ranking produces consistent results
|
|
1655
|
+
- [ ] Background orchestration runs without blocking user
|
|
1656
|
+
|
|
1657
|
+
**Total Effort:** ~63 hours (~3 weeks)
|
|
1658
|
+
|
|
1659
|
+
---
|
|
1660
|
+
|
|
1661
|
+
### Phase 4: Convergence & Iteration (Weeks 9-10)
|
|
1662
|
+
|
|
1663
|
+
**Goal:** Multi-round tournaments, convergence detection, state persistence
|
|
1664
|
+
|
|
1665
|
+
**Tasks:**
|
|
1666
|
+
|
|
1667
|
+
- [ ] **Task 4.1:** Build tournament iteration loop
|
|
1668
|
+
- Complexity: Medium (8h)
|
|
1669
|
+
- Dependencies: Phase 3 complete
|
|
1670
|
+
- Owner: Orchestrator team
|
|
1671
|
+
- Deliverable: Winner advances to next round vs 2 new variations
|
|
1672
|
+
|
|
1673
|
+
- [ ] **Task 4.2:** Implement convergence detection (multi-criteria)
|
|
1674
|
+
- Complexity: Medium (6h)
|
|
1675
|
+
- Dependencies: Tournament loop
|
|
1676
|
+
- Owner: Orchestrator team
|
|
1677
|
+
- Deliverable: Stops when score plateau, time limit, or target met
|
|
1678
|
+
|
|
1679
|
+
- [ ] **Task 4.3:** Add adaptive tournament sizing
|
|
1680
|
+
- Complexity: Small (4h)
|
|
1681
|
+
- Dependencies: Complexity detection
|
|
1682
|
+
- Owner: Orchestrator team
|
|
1683
|
+
- Deliverable: Simple (3 rounds), Moderate (5), Complex (7)
|
|
1684
|
+
|
|
1685
|
+
- [ ] **Task 4.4:** Build state persistence system
|
|
1686
|
+
- Complexity: Medium (6h)
|
|
1687
|
+
- Dependencies: Tournament loop
|
|
1688
|
+
- Owner: Orchestrator team
|
|
1689
|
+
- Deliverable: Can resume arena if interrupted
|
|
1690
|
+
|
|
1691
|
+
- [ ] **Task 4.5:** Implement graceful interruption handling
|
|
1692
|
+
- Complexity: Small (4h)
|
|
1693
|
+
- Dependencies: State persistence
|
|
1694
|
+
- Owner: Orchestrator team
|
|
1695
|
+
- Deliverable: User can stop arena, resume later
|
|
1696
|
+
|
|
1697
|
+
- [ ] **Task 4.6:** Build arena completion notification
|
|
1698
|
+
- Complexity: Small (3h)
|
|
1699
|
+
- Dependencies: Convergence detection
|
|
1700
|
+
- Owner: UI team
|
|
1701
|
+
- Deliverable: User notified when v1.0 ready
|
|
1702
|
+
|
|
1703
|
+
**Validation Checkpoint:**
|
|
1704
|
+
- [ ] Arena runs multiple rounds until convergence
|
|
1705
|
+
- [ ] Convergence criteria working correctly (no infinite loops)
|
|
1706
|
+
- [ ] Can interrupt and resume arena
|
|
1707
|
+
- [ ] Notification appears when complete
|
|
1708
|
+
- [ ] Arena completes in < 30 min for moderate skills
|
|
1709
|
+
|
|
1710
|
+
**Total Effort:** ~31 hours (~1.5 weeks)
|
|
1711
|
+
|
|
1712
|
+
---
|
|
1713
|
+
|
|
1714
|
+
### Phase 5: User Validation & Feedback (Weeks 11-12)
|
|
1715
|
+
|
|
1716
|
+
**Goal:** Local storage, review UI, user scoring, feedback collection
|
|
1717
|
+
|
|
1718
|
+
**Tasks:**
|
|
1719
|
+
|
|
1720
|
+
- [ ] **Task 5.1:** Create local arena results storage
|
|
1721
|
+
- Complexity: Small (4h)
|
|
1722
|
+
- Dependencies: Output capture
|
|
1723
|
+
- Owner: Storage team
|
|
1724
|
+
- Deliverable: JSON files in `.claude/skills/.../arena_results/`
|
|
1725
|
+
|
|
1726
|
+
- [ ] **Task 5.2:** Build results browsing UI
|
|
1727
|
+
- Complexity: Medium (8h)
|
|
1728
|
+
- Dependencies: Local storage
|
|
1729
|
+
- Owner: UI team
|
|
1730
|
+
- Deliverable: User can view all round results
|
|
1731
|
+
|
|
1732
|
+
- [ ] **Task 5.3:** Implement side-by-side output comparison
|
|
1733
|
+
- Complexity: Medium (6h)
|
|
1734
|
+
- Dependencies: Results browsing
|
|
1735
|
+
- Owner: UI team
|
|
1736
|
+
- Deliverable: Compare outputs from different variations
|
|
1737
|
+
|
|
1738
|
+
- [ ] **Task 5.4:** Add user scoring interface (1-5 stars)
|
|
1739
|
+
- Complexity: Small (4h)
|
|
1740
|
+
- Dependencies: Comparison UI
|
|
1741
|
+
- Owner: UI team
|
|
1742
|
+
- Deliverable: User can rate each output
|
|
1743
|
+
|
|
1744
|
+
- [ ] **Task 5.5:** Build version comparison (v0.1 vs v1.0)
|
|
1745
|
+
- Complexity: Medium (6h)
|
|
1746
|
+
- Dependencies: Arena results
|
|
1747
|
+
- Owner: UI team
|
|
1748
|
+
- Deliverable: Shows improvement metrics (+15 points, etc.)
|
|
1749
|
+
|
|
1750
|
+
- [ ] **Task 5.6:** Implement feedback submission to database
|
|
1751
|
+
- Complexity: Small (3h)
|
|
1752
|
+
- Dependencies: User scoring, Pinecone API
|
|
1753
|
+
- Owner: API team
|
|
1754
|
+
- Deliverable: POST /feedback endpoint working
|
|
1755
|
+
|
|
1756
|
+
**Validation Checkpoint:**
|
|
1757
|
+
- [ ] Arena results stored locally
|
|
1758
|
+
- [ ] User can review and compare outputs
|
|
1759
|
+
- [ ] User can score outputs and submit feedback
|
|
1760
|
+
- [ ] Feedback appears in database
|
|
1761
|
+
- [ ] UI is clear and usable
|
|
1762
|
+
|
|
1763
|
+
**Total Effort:** ~31 hours (~1.5 weeks)
|
|
1764
|
+
|
|
1765
|
+
---
|
|
1766
|
+
|
|
1767
|
+
### Phase 6: Collective Database Features (Weeks 13-14)
|
|
1768
|
+
|
|
1769
|
+
**Goal:** Skill submission, leaderboards, lineage tracking
|
|
1770
|
+
|
|
1771
|
+
**Tasks:**
|
|
1772
|
+
|
|
1773
|
+
- [ ] **Task 6.1:** Implement POST /submit API endpoint
|
|
1774
|
+
- Complexity: Medium (8h)
|
|
1775
|
+
- Dependencies: Pinecone database
|
|
1776
|
+
- Owner: API team
|
|
1777
|
+
- Deliverable: Can submit skills to collective
|
|
1778
|
+
|
|
1779
|
+
- [ ] **Task 6.2:** Build submission UI with privacy controls
|
|
1780
|
+
- Complexity: Medium (6h)
|
|
1781
|
+
- Dependencies: Arena completion
|
|
1782
|
+
- Owner: UI team
|
|
1783
|
+
- Deliverable: User prompted to submit, opt-in, privacy clear
|
|
1784
|
+
|
|
1785
|
+
- [ ] **Task 6.3:** Add champion comparison logic
|
|
1786
|
+
- Complexity: Small (3h)
|
|
1787
|
+
- Dependencies: Database query, arena scores
|
|
1788
|
+
- Owner: API team
|
|
1789
|
+
- Deliverable: Detect when user skill beats database champions
|
|
1790
|
+
|
|
1791
|
+
- [ ] **Task 6.4:** Implement GET /leaderboard endpoint
|
|
1792
|
+
- Complexity: Small (4h)
|
|
1793
|
+
- Dependencies: Pinecone database
|
|
1794
|
+
- Owner: API team
|
|
1795
|
+
- Deliverable: Returns top skills by domain
|
|
1796
|
+
|
|
1797
|
+
- [ ] **Task 6.5:** Build leaderboard display UI
|
|
1798
|
+
- Complexity: Medium (6h)
|
|
1799
|
+
- Dependencies: Leaderboard API
|
|
1800
|
+
- Owner: UI team
|
|
1801
|
+
- Deliverable: Shows top skills with scores, ratings
|
|
1802
|
+
|
|
1803
|
+
- [ ] **Task 6.6:** Add skill lineage tracking
|
|
1804
|
+
- Complexity: Small (4h)
|
|
1805
|
+
- Dependencies: Submission API
|
|
1806
|
+
- Owner: API team
|
|
1807
|
+
- Deliverable: parent_id, generation, improvement_pct stored
|
|
1808
|
+
|
|
1809
|
+
- [ ] **Task 6.7:** Implement ELO rating calculation
|
|
1810
|
+
- Complexity: Medium (5h)
|
|
1811
|
+
- Dependencies: Usage data, feedback
|
|
1812
|
+
- Owner: API team
|
|
1813
|
+
- Deliverable: Skills have ELO ratings that update
|
|
1814
|
+
|
|
1815
|
+
**Validation Checkpoint:**
|
|
1816
|
+
- [ ] Can submit skills to collective
|
|
1817
|
+
- [ ] Leaderboard shows top skills accurately
|
|
1818
|
+
- [ ] Lineage tracking works (can trace ancestry)
|
|
1819
|
+
- [ ] ELO ratings update based on usage
|
|
1820
|
+
- [ ] Privacy controls working (anonymized data)
|
|
1821
|
+
|
|
1822
|
+
**Total Effort:** ~36 hours (~1.5 weeks)
|
|
1823
|
+
|
|
1824
|
+
---
|
|
1825
|
+
|
|
1826
|
+
### Phase 7: Testing & Polish (Weeks 15-16)
|
|
1827
|
+
|
|
1828
|
+
**Goal:** Comprehensive testing, bug fixes, performance optimization, documentation
|
|
1829
|
+
|
|
1830
|
+
**Tasks:**
|
|
1831
|
+
|
|
1832
|
+
- [ ] **Task 7.1:** Write comprehensive unit tests
|
|
1833
|
+
- Complexity: Large (16h)
|
|
1834
|
+
- Dependencies: All features implemented
|
|
1835
|
+
- Owner: QA team
|
|
1836
|
+
- Deliverable: > 80% code coverage
|
|
1837
|
+
|
|
1838
|
+
- [ ] **Task 7.2:** Write integration tests
|
|
1839
|
+
- Complexity: Large (12h)
|
|
1840
|
+
- Dependencies: All features implemented
|
|
1841
|
+
- Owner: QA team
|
|
1842
|
+
- Deliverable: All workflows tested end-to-end
|
|
1843
|
+
|
|
1844
|
+
- [ ] **Task 7.3:** Write E2E tests (user journeys)
|
|
1845
|
+
- Complexity: Medium (10h)
|
|
1846
|
+
- Dependencies: All features implemented
|
|
1847
|
+
- Owner: QA team
|
|
1848
|
+
- Deliverable: Key user journeys automated
|
|
1849
|
+
|
|
1850
|
+
- [ ] **Task 7.4:** Performance testing and optimization
|
|
1851
|
+
- Complexity: Medium (8h)
|
|
1852
|
+
- Dependencies: All features implemented
|
|
1853
|
+
- Owner: Performance team
|
|
1854
|
+
- Deliverable: Meet all performance targets (< 30s base, < 30min arena, etc.)
|
|
1855
|
+
|
|
1856
|
+
- [ ] **Task 7.5:** Bug fixes from testing
|
|
1857
|
+
- Complexity: Variable (16h estimated)
|
|
1858
|
+
- Dependencies: Tests written
|
|
1859
|
+
- Owner: All teams
|
|
1860
|
+
- Deliverable: All critical bugs fixed, P1 bugs addressed
|
|
1861
|
+
|
|
1862
|
+
- [ ] **Task 7.6:** Write user documentation
|
|
1863
|
+
- Complexity: Medium (8h)
|
|
1864
|
+
- Dependencies: All features implemented
|
|
1865
|
+
- Owner: Docs team
|
|
1866
|
+
- Deliverable: README, usage guide, troubleshooting
|
|
1867
|
+
|
|
1868
|
+
- [ ] **Task 7.7:** Create example arena results
|
|
1869
|
+
- Complexity: Small (4h)
|
|
1870
|
+
- Dependencies: Arena working
|
|
1871
|
+
- Owner: Docs team
|
|
1872
|
+
- Deliverable: Example outputs for documentation
|
|
1873
|
+
|
|
1874
|
+
**Validation Checkpoint:**
|
|
1875
|
+
- [ ] All tests passing (unit, integration, E2E)
|
|
1876
|
+
- [ ] Performance benchmarks met
|
|
1877
|
+
- [ ] Zero critical bugs, minimal P1 bugs
|
|
1878
|
+
- [ ] Documentation complete and clear
|
|
1879
|
+
- [ ] System ready for production
|
|
1880
|
+
|
|
1881
|
+
**Total Effort:** ~74 hours (~3.5 weeks)
|
|
1882
|
+
|
|
1883
|
+
---
|
|
1884
|
+
|
|
1885
|
+
### Phase 8: Deployment & Rollout (Weeks 17-18)
|
|
1886
|
+
|
|
1887
|
+
**Goal:** Deploy to production, gradual rollout, monitoring, iteration
|
|
1888
|
+
|
|
1889
|
+
**Tasks:**
|
|
1890
|
+
|
|
1891
|
+
- [ ] **Task 8.1:** Deploy serverless API to production
|
|
1892
|
+
- Complexity: Small (4h)
|
|
1893
|
+
- Dependencies: API tested
|
|
1894
|
+
- Owner: DevOps team
|
|
1895
|
+
- Deliverable: API live at production URL
|
|
1896
|
+
|
|
1897
|
+
- [ ] **Task 8.2:** Set up Pinecone production database
|
|
1898
|
+
- Complexity: Small (2h)
|
|
1899
|
+
- Dependencies: Schema finalized
|
|
1900
|
+
- Owner: DevOps team
|
|
1901
|
+
- Deliverable: Production Pinecone index ready
|
|
1902
|
+
|
|
1903
|
+
- [ ] **Task 8.3:** Deploy enhanced skill_creating skill
|
|
1904
|
+
- Complexity: Small (2h)
|
|
1905
|
+
- Dependencies: All features tested
|
|
1906
|
+
- Owner: DevOps team
|
|
1907
|
+
- Deliverable: Updated SKILL.md deployed
|
|
1908
|
+
|
|
1909
|
+
- [ ] **Task 8.4:** Set up monitoring and alerting
|
|
1910
|
+
- Complexity: Small (4h)
|
|
1911
|
+
- Dependencies: API deployed
|
|
1912
|
+
- Owner: DevOps team
|
|
1913
|
+
- Deliverable: Track API uptime, error rates, arena completion rates
|
|
1914
|
+
|
|
1915
|
+
- [ ] **Task 8.5:** Gradual rollout (beta users first)
|
|
1916
|
+
- Complexity: Small (2h)
|
|
1917
|
+
- Dependencies: Monitoring set up
|
|
1918
|
+
- Owner: Product team
|
|
1919
|
+
- Deliverable: 10 beta users test system
|
|
1920
|
+
|
|
1921
|
+
- [ ] **Task 8.6:** Gather feedback from beta users
|
|
1922
|
+
- Complexity: Medium (8h)
|
|
1923
|
+
- Dependencies: Beta rollout
|
|
1924
|
+
- Owner: Product team
|
|
1925
|
+
- Deliverable: Feedback collected, prioritized
|
|
1926
|
+
|
|
1927
|
+
- [ ] **Task 8.7:** Iterate based on feedback
|
|
1928
|
+
- Complexity: Medium (8h)
|
|
1929
|
+
- Dependencies: Feedback gathered
|
|
1930
|
+
- Owner: All teams
|
|
1931
|
+
- Deliverable: Top 3-5 improvements implemented
|
|
1932
|
+
|
|
1933
|
+
- [ ] **Task 8.8:** Full public launch
|
|
1934
|
+
- Complexity: Small (2h)
|
|
1935
|
+
- Dependencies: Beta successful
|
|
1936
|
+
- Owner: Product team
|
|
1937
|
+
- Deliverable: Announcement, full availability
|
|
1938
|
+
|
|
1939
|
+
**Validation Checkpoint:**
|
|
1940
|
+
- [ ] API stable and responding correctly
|
|
1941
|
+
- [ ] Beta users successfully creating and optimizing skills
|
|
1942
|
+
- [ ] Monitoring shows healthy metrics
|
|
1943
|
+
- [ ] Feedback incorporated
|
|
1944
|
+
- [ ] Ready for public launch
|
|
1945
|
+
|
|
1946
|
+
**Total Effort:** ~32 hours (~1.5 weeks)
|
|
1947
|
+
|
|
1948
|
+
---
|
|
1949
|
+
|
|
1950
|
+
### Task Dependencies Visualization
|
|
1951
|
+
|
|
1952
|
+
```
|
|
1953
|
+
Phase 1 (Foundation):
|
|
1954
|
+
1.1 (Pinecone) → 1.2 (Embeddings) → 1.3 (Search API)
|
|
1955
|
+
1.4 (Requirements) ──────────────────┐
|
|
1956
|
+
1.5 (Pattern Research) ──────────────┤
|
|
1957
|
+
1.6 (Domain Research) ───────────────┴─→ Phase 2
|
|
1958
|
+
|
|
1959
|
+
Phase 2 (Base Generation):
|
|
1960
|
+
[Phase 1] → 2.1 (Questions) → 2.2 (Weights) → 2.3 (Template) → 2.4 (Deploy)
|
|
1961
|
+
1.3 (Search) ────────────────────────────────────────┐
|
|
1962
|
+
2.4 (Deploy) ────────────────────────────────────────┴─→ 2.5 (Workflow)
|
|
1963
|
+
2.4 (Deploy) → 2.6 (Quick Scoring)
|
|
1964
|
+
|
|
1965
|
+
Phase 3 (Arena Core):
|
|
1966
|
+
3.1 (Orchestrator) ──────────────────────────────────┐
|
|
1967
|
+
3.2 (Variation Gen) ─────────────────────────────────┤
|
|
1968
|
+
3.3 (Test Data) ─────────────────────────────────────┤
|
|
1969
|
+
3.4 (Execution) → 3.5 (Output Capture) ──────────────┼─→ 3.8 (Ranking)
|
|
1970
|
+
3.6 (Judge) → 3.7 (Bias Mitigation) ─────────────────┘
|
|
1971
|
+
|
|
1972
|
+
Phase 4 (Convergence):
|
|
1973
|
+
[Phase 3] → 4.1 (Tournament Loop) → 4.2 (Convergence) → 4.6 (Notification)
|
|
1974
|
+
4.1 → 4.3 (Adaptive Sizing)
|
|
1975
|
+
4.1 → 4.4 (State Persist) → 4.5 (Interruption)
|
|
1976
|
+
|
|
1977
|
+
Phase 5 (User Validation):
|
|
1978
|
+
3.5 (Output Capture) → 5.1 (Local Storage) → 5.2 (Browse UI)
|
|
1979
|
+
5.2 → 5.3 (Comparison) → 5.4 (User Scoring) → 5.6 (Feedback API)
|
|
1980
|
+
4.6 (Notification) → 5.5 (Version Compare)
|
|
1981
|
+
|
|
1982
|
+
Phase 6 (Collective):
|
|
1983
|
+
1.1 (Pinecone) ──────────────────────────────────────┐
|
|
1984
|
+
4.2 (Convergence) → 6.1 (Submit API) → 6.2 (Submit UI)│
|
|
1985
|
+
6.1 → 6.3 (Champion Compare) ───────────────────────┤
|
|
1986
|
+
6.1 → 6.4 (Leaderboard API) → 6.5 (Leaderboard UI) ─┤
|
|
1987
|
+
6.1 → 6.6 (Lineage) ─────────────────────────────────┤
|
|
1988
|
+
6.1 → 6.7 (ELO) ─────────────────────────────────────┘
|
|
1989
|
+
|
|
1990
|
+
Phase 7 (Testing):
|
|
1991
|
+
[All Phases] → 7.1 (Unit) ─┐
|
|
1992
|
+
7.2 (Integration) ─┼→ 7.5 (Bug Fixes) → 7.6 (Docs)
|
|
1993
|
+
7.3 (E2E) ─┘ → 7.4 (Performance) → 7.7 (Examples)
|
|
1994
|
+
|
|
1995
|
+
Phase 8 (Deployment):
|
|
1996
|
+
7.5 (Bug Fixes) → 8.1 (API Deploy) ─┐
|
|
1997
|
+
7.5 → 8.2 (Pinecone Prod) ──────────┼→ 8.4 (Monitoring) → 8.5 (Beta)
|
|
1998
|
+
7.5 → 8.3 (Skill Deploy) ───────────┘
|
|
1999
|
+
8.5 → 8.6 (Feedback) → 8.7 (Iterate) → 8.8 (Launch)
|
|
2000
|
+
|
|
2001
|
+
Critical Path:
|
|
2002
|
+
1.1 → 1.2 → 1.3 → 2.5 → 3.1 → 3.6 → 4.1 → 4.2 → 5.1 → 6.1 → 7.5 → 8.8
|
|
2003
|
+
```
|
|
2004
|
+
|
|
2005
|
+
---
|
|
2006
|
+
|
|
2007
|
+
### Effort Estimation
|
|
2008
|
+
|
|
2009
|
+
**Total Estimated Effort by Phase:**
|
|
2010
|
+
- Phase 1 (Foundation): 33 hours
|
|
2011
|
+
- Phase 2 (Base Generation): 34 hours
|
|
2012
|
+
- Phase 3 (Arena Core): 63 hours
|
|
2013
|
+
- Phase 4 (Convergence): 31 hours
|
|
2014
|
+
- Phase 5 (User Validation): 31 hours
|
|
2015
|
+
- Phase 6 (Collective): 36 hours
|
|
2016
|
+
- Phase 7 (Testing): 74 hours
|
|
2017
|
+
- Phase 8 (Deployment): 32 hours
|
|
2018
|
+
|
|
2019
|
+
**Total: ~334 hours**
|
|
2020
|
+
|
|
2021
|
+
**With 2-person team:**
|
|
2022
|
+
- ~167 hours per person
|
|
2023
|
+
- ~21 weeks at 8 hours/week
|
|
2024
|
+
- **~5 months calendar time**
|
|
2025
|
+
|
|
2026
|
+
**With 3-person team:**
|
|
2027
|
+
- ~111 hours per person
|
|
2028
|
+
- ~14 weeks at 8 hours/week
|
|
2029
|
+
- **~3.5 months calendar time**
|
|
2030
|
+
|
|
2031
|
+
**Risk Buffer:** +25% (84 hours) for unknowns, integration challenges, and iteration
|
|
2032
|
+
|
|
2033
|
+
**Final Estimate with Buffer:**
|
|
2034
|
+
- 2-person team: **~6-7 months**
|
|
2035
|
+
- 3-person team: **~4-5 months**
|
|
2036
|
+
|
|
2037
|
+
**MVP (Phases 1-5 only):**
|
|
2038
|
+
- Total: 192 hours
|
|
2039
|
+
- 2-person team: ~3 months
|
|
2040
|
+
- 3-person team: ~2 months
|
|
2041
|
+
|
|
2042
|
+
---
|
|
2043
|
+
|
|
2044
|
+
## Out of Scope
|
|
2045
|
+
|
|
2046
|
+
Explicitly NOT included in this release:
|
|
2047
|
+
|
|
2048
|
+
### 1. Server Farm Optimization (Future Phase 2)
|
|
2049
|
+
|
|
2050
|
+
**What:** Centralized server farms running continuous global tournaments of all submitted skills to find absolute best across all users.
|
|
2051
|
+
|
|
2052
|
+
**Why Out of Scope:**
|
|
2053
|
+
- Requires significant infrastructure ($$)
|
|
2054
|
+
- MVP focuses on local execution (user pays own tokens)
|
|
2055
|
+
- Can be added later without changing architecture
|
|
2056
|
+
- Future enhancement: 24/7 deep research and optimization
|
|
2057
|
+
|
|
2058
|
+
**Future Consideration:** Document for Phase 2, 6-12 months post-launch
|
|
2059
|
+
|
|
2060
|
+
---
|
|
2061
|
+
|
|
2062
|
+
### 2. Multi-Language Skill Support
|
|
2063
|
+
|
|
2064
|
+
**What:** Skills that work across multiple programming languages or natural languages.
|
|
2065
|
+
|
|
2066
|
+
**Why Out of Scope:**
|
|
2067
|
+
- MVP focuses on English, code-agnostic skills
|
|
2068
|
+
- Adds complexity to question generation and judging
|
|
2069
|
+
- Limited user demand in initial research
|
|
2070
|
+
|
|
2071
|
+
**Future Consideration:** Add if international users request it
|
|
2072
|
+
|
|
2073
|
+
---
|
|
2074
|
+
|
|
2075
|
+
### 3. Skill Marketplace / Monetization
|
|
2076
|
+
|
|
2077
|
+
**What:** Paid skills, premium collective access, skill authors earning revenue.
|
|
2078
|
+
|
|
2079
|
+
**Why Out of Scope:**
|
|
2080
|
+
- Collective is free and open initially
|
|
2081
|
+
- Monetization complicates launch
|
|
2082
|
+
- Focus on quality and adoption first
|
|
2083
|
+
|
|
2084
|
+
**Future Consideration:** Evaluate after 100+ skills and 1000+ users
|
|
2085
|
+
|
|
2086
|
+
---
|
|
2087
|
+
|
|
2088
|
+
### 4. IDE Integration Beyond Claude Code
|
|
2089
|
+
|
|
2090
|
+
**What:** VS Code extension, JetBrains plugin, etc.
|
|
2091
|
+
|
|
2092
|
+
**Why Out of Scope:**
|
|
2093
|
+
- MVP is Claude Code only
|
|
2094
|
+
- Different architecture for each IDE
|
|
2095
|
+
- Limited resources
|
|
2096
|
+
|
|
2097
|
+
**Future Consideration:** Partner integrations post-launch
|
|
2098
|
+
|
|
2099
|
+
---
|
|
2100
|
+
|
|
2101
|
+
### 5. Advanced Judge Models (GPT-4, Gemini, etc.)
|
|
2102
|
+
|
|
2103
|
+
**What:** Support for non-Anthropic judge models.
|
|
2104
|
+
|
|
2105
|
+
**Why Out of Scope:**
|
|
2106
|
+
- Anthropic models (Sonnet, Opus) sufficient for MVP
|
|
2107
|
+
- Cross-platform adds complexity
|
|
2108
|
+
- Focus on one provider initially
|
|
2109
|
+
|
|
2110
|
+
**Future Consideration:** Add multi-model support if users request it
|
|
2111
|
+
|
|
2112
|
+
---
|
|
2113
|
+
|
|
2114
|
+
### 6. Real-Time Collaboration on Skills
|
|
2115
|
+
|
|
2116
|
+
**What:** Multiple users co-creating skills simultaneously.
|
|
2117
|
+
|
|
2118
|
+
**Why Out of Scope:**
|
|
2119
|
+
- MVP is single-user workflow
|
|
2120
|
+
- Requires collaborative editing infrastructure
|
|
2121
|
+
- Not in initial requirements
|
|
2122
|
+
|
|
2123
|
+
**Future Consideration:** If teams request it
|
|
2124
|
+
|
|
2125
|
+
---
|
|
2126
|
+
|
|
2127
|
+
### 7. Automated Skill Maintenance
|
|
2128
|
+
|
|
2129
|
+
**What:** System automatically updates skills when dependencies change or new best practices emerge.
|
|
2130
|
+
|
|
2131
|
+
**Why Out of Scope:**
|
|
2132
|
+
- Requires continuous monitoring
|
|
2133
|
+
- Risk of breaking working skills
|
|
2134
|
+
- Manual updates sufficient for MVP
|
|
2135
|
+
|
|
2136
|
+
**Future Consideration:** Auto-suggest updates, user approves
|
|
2137
|
+
|
|
2138
|
+
---
|
|
2139
|
+
|
|
2140
|
+
### 8. Skill Analytics Dashboard
|
|
2141
|
+
|
|
2142
|
+
**What:** Detailed analytics on skill usage, performance, user demographics.
|
|
2143
|
+
|
|
2144
|
+
**Why Out of Scope:**
|
|
2145
|
+
- Basic metrics sufficient for MVP (ratings, usage count)
|
|
2146
|
+
- Privacy concerns with detailed tracking
|
|
2147
|
+
- Focus on core functionality first
|
|
2148
|
+
|
|
2149
|
+
**Future Consideration:** Opt-in analytics for skill authors
|
|
2150
|
+
|
|
2151
|
+
---
|
|
2152
|
+
|
|
2153
|
+
## Open Questions & Risks
|
|
2154
|
+
|
|
2155
|
+
### Open Questions
|
|
2156
|
+
|
|
2157
|
+
#### Q1: What judge model should we use?
|
|
2158
|
+
|
|
2159
|
+
**Current Status:** Considering Opus (highest quality) vs Sonnet (faster, cheaper)
|
|
2160
|
+
|
|
2161
|
+
**Options:**
|
|
2162
|
+
- **A)** Always use Opus for judging (consistent, highest quality)
|
|
2163
|
+
- **B)** Haiku for early rounds, Opus for finals (optimized cost)
|
|
2164
|
+
- **C)** User chooses in questions (flexibility)
|
|
2165
|
+
|
|
2166
|
+
**Owner:** Technical lead
|
|
2167
|
+
|
|
2168
|
+
**Deadline:** End of Phase 2 (before arena implementation)
|
|
2169
|
+
|
|
2170
|
+
**Impact:** Medium (affects arena runtime and cost)
|
|
2171
|
+
|
|
2172
|
+
**Recommendation:** Option B (progressive) - balances quality and cost. Early rounds don't need Opus-level precision, finals do.
|
|
2173
|
+
|
|
2174
|
+
---
|
|
2175
|
+
|
|
2176
|
+
#### Q2: How many test scenarios per round?
|
|
2177
|
+
|
|
2178
|
+
**Current Status:** Considering 1 vs 3 scenarios per round
|
|
2179
|
+
|
|
2180
|
+
**Options:**
|
|
2181
|
+
- **A)** 1 scenario per round (faster, less reliable)
|
|
2182
|
+
- **B)** 3 scenarios per round (slower, more reliable)
|
|
2183
|
+
- **C)** Adaptive based on skill complexity (1 for simple, 3 for complex)
|
|
2184
|
+
|
|
2185
|
+
**Owner:** Arena architect
|
|
2186
|
+
|
|
2187
|
+
**Deadline:** End of Phase 3 (before convergence testing)
|
|
2188
|
+
|
|
2189
|
+
**Impact:** High (affects arena reliability and runtime)
|
|
2190
|
+
|
|
2191
|
+
**Recommendation:** Option C (adaptive) - simple skills don't need extensive testing, complex skills do.
|
|
2192
|
+
|
|
2193
|
+
---
|
|
2194
|
+
|
|
2195
|
+
#### Q3: Should we allow user override of convergence criteria?
|
|
2196
|
+
|
|
2197
|
+
**Current Status:** System auto-detects convergence
|
|
2198
|
+
|
|
2199
|
+
**Options:**
|
|
2200
|
+
- **A)** Fully automatic (no user control)
|
|
2201
|
+
- **B)** User can set max time/rounds (simple override)
|
|
2202
|
+
- **C)** User can configure all criteria (full control)
|
|
2203
|
+
|
|
2204
|
+
**Owner:** Product team
|
|
2205
|
+
|
|
2206
|
+
**Deadline:** End of Phase 4 (convergence implementation)
|
|
2207
|
+
|
|
2208
|
+
**Impact:** Low (nice-to-have, not critical)
|
|
2209
|
+
|
|
2210
|
+
**Recommendation:** Option B (simple override) - most users trust defaults, power users want time control.
|
|
2211
|
+
|
|
2212
|
+
---
|
|
2213
|
+
|
|
2214
|
+
#### Q4: How should we handle skill activation conflicts?
|
|
2215
|
+
|
|
2216
|
+
**Current Status:** Multiple skills might activate on same trigger
|
|
2217
|
+
|
|
2218
|
+
**Options:**
|
|
2219
|
+
- **A)** First match wins (simple, may not be best)
|
|
2220
|
+
- **B)** Highest-scored skill wins (quality-focused)
|
|
2221
|
+
- **C)** User prompted to choose (manual, slower)
|
|
2222
|
+
|
|
2223
|
+
**Owner:** skill_creating maintainer
|
|
2224
|
+
|
|
2225
|
+
**Deadline:** End of Phase 2 (affects base generation)
|
|
2226
|
+
|
|
2227
|
+
**Impact:** Medium (affects user experience)
|
|
2228
|
+
|
|
2229
|
+
**Recommendation:** Option B (highest-scored) - leverages arena scores for automatic quality selection.
|
|
2230
|
+
|
|
2231
|
+
---
|
|
2232
|
+
|
|
2233
|
+
#### Q5: Should database submissions be moderated?
|
|
2234
|
+
|
|
2235
|
+
**Current Status:** Considering auto-accept vs review queue
|
|
2236
|
+
|
|
2237
|
+
**Options:**
|
|
2238
|
+
- **A)** Auto-accept all submissions (fast, risk of spam)
|
|
2239
|
+
- **B)** Auto-scan for dangerous patterns, flag suspicious (balanced)
|
|
2240
|
+
- **C)** Manual review all submissions (slow, thorough)
|
|
2241
|
+
|
|
2242
|
+
**Owner:** Security team
|
|
2243
|
+
|
|
2244
|
+
**Deadline:** End of Phase 6 (before collective launch)
|
|
2245
|
+
|
|
2246
|
+
**Impact:** High (affects collective quality and safety)
|
|
2247
|
+
|
|
2248
|
+
**Recommendation:** Option B (auto-scan + flag) - prevents obvious abuse, human review for edge cases.
|
|
2249
|
+
|
|
2250
|
+
---
|
|
2251
|
+
|
|
2252
|
+
### Risks & Mitigation
|
|
2253
|
+
|
|
2254
|
+
| Risk | Likelihood | Impact | Severity | Mitigation | Contingency |
|
|
2255
|
+
|------|------------|--------|----------|------------|-------------|
|
|
2256
|
+
| Arena takes longer than 30 min target | Medium | Medium | **Medium** | Adaptive tournament sizing (simple skills fewer rounds), early convergence detection, user can interrupt | User can keep using v0.1 if arena takes too long, improve convergence criteria based on data |
|
|
2257
|
+
| Test data not realistic enough | Medium | High | **High** | Web search for current examples, persona-based generation, evolution refinement, user validation scoring | Cache proven realistic scenarios, allow users to provide examples, improve generation prompts |
|
|
2258
|
+
| Judge produces inconsistent results | Medium | High | **High** | Multiple comparisons with order randomization, chain-of-thought required, user review override, aggregate multiple calls | Flag inconsistent judgments for human review, use ensemble of judges, tune prompts |
|
|
2259
|
+
| User machine offline during arena | Low | Medium | **Low** | State persistence (save progress), resume capability, graceful degradation (partial results usable) | Allow arena restart from last checkpoint, warn users before starting |
|
|
2260
|
+
| Pinecone API down or rate limited | Low | High | **Medium** | Graceful degradation (create skills without search), retry logic with exponential backoff, cache common queries | Fall back to local-only mode, queue submissions for later |
|
|
2261
|
+
| Malicious skill submissions | Low | Critical | **High** | Automated scanning for dangerous patterns (rm -rf, curl to unknown domains), review queue for flagged skills, rate limiting per user, community flagging | Manual review, blocklist, user reputation system |
|
|
2262
|
+
| High API costs (OpenAI embeddings) | Medium | Low | **Low** | Cache embeddings aggressively, batch requests, use efficient models (text-embed-3-small), minimal re-embedding | User pays own API costs, optimize caching strategy |
|
|
2263
|
+
| Low skill submission rate | Medium | Medium | **Medium** | Encourage submissions (special notification if beats champion), gamification (leaderboards), showcase benefits (others using your skill) | Pre-populate database with high-quality seed skills, improve submission UX |
|
|
2264
|
+
| User confusion about v0.1 vs v1.0 | Medium | Low | **Low** | Clear labeling (v0.1 "Quick Base", v1.0 "Arena Optimized"), show improvement metrics (+15 points), user can test both | Improve UI copy, add tooltips, documentation |
|
|
2265
|
+
|
|
2266
|
+
---
|
|
2267
|
+
|
|
2268
|
+
## Validation Checkpoints
|
|
2269
|
+
|
|
2270
|
+
### Checkpoint 1: End of Phase 1 (Foundation)
|
|
2271
|
+
|
|
2272
|
+
**Criteria:**
|
|
2273
|
+
- [ ] Pinecone database set up and queryable
|
|
2274
|
+
- [ ] Can generate embeddings and search semantically
|
|
2275
|
+
- [ ] Search returns relevant results in < 2s
|
|
2276
|
+
- [ ] Can extract requirements from user requests
|
|
2277
|
+
- [ ] Quick research (pattern + domain) completes in < 20s
|
|
2278
|
+
- [ ] Unit tests passing for all components
|
|
2279
|
+
|
|
2280
|
+
**If Failed:**
|
|
2281
|
+
- Debug Pinecone integration
|
|
2282
|
+
- Optimize search query performance
|
|
2283
|
+
- Improve requirement extraction accuracy
|
|
2284
|
+
- Don't proceed to Phase 2 until stable
|
|
2285
|
+
|
|
2286
|
+
---
|
|
2287
|
+
|
|
2288
|
+
### Checkpoint 2: End of Phase 2 (Base Generation)
|
|
2289
|
+
|
|
2290
|
+
**Criteria:**
|
|
2291
|
+
- [ ] Question generator produces relevant domain-specific questions
|
|
2292
|
+
- [ ] User answers convert to reasonable weights
|
|
2293
|
+
- [ ] Base skill (v0.1) generates in < 30s
|
|
2294
|
+
- [ ] v0.1 skill is valid SKILL.md and activates correctly
|
|
2295
|
+
- [ ] Database-first workflow works end-to-end (query → select → deploy)
|
|
2296
|
+
- [ ] Integration tests passing
|
|
2297
|
+
|
|
2298
|
+
**If Failed:**
|
|
2299
|
+
- Refine question templates by domain
|
|
2300
|
+
- Fix skill template generation bugs
|
|
2301
|
+
- Improve database query UI
|
|
2302
|
+
- Don't proceed to Phase 3 until v0.1 generation reliable
|
|
2303
|
+
|
|
2304
|
+
---
|
|
2305
|
+
|
|
2306
|
+
### Checkpoint 3: End of Phase 3 (Arena Core)
|
|
2307
|
+
|
|
2308
|
+
**Criteria:**
|
|
2309
|
+
- [ ] Can generate 3 variations from base skill
|
|
2310
|
+
- [ ] All variations execute with same test input
|
|
2311
|
+
- [ ] Test data is realistic (validated by spot-checking)
|
|
2312
|
+
- [ ] Judge compares outputs and selects winner with reasoning
|
|
2313
|
+
- [ ] Bradley-Terry ranking produces consistent results
|
|
2314
|
+
- [ ] Background orchestration runs without blocking user
|
|
2315
|
+
- [ ] Integration tests for arena passing
|
|
2316
|
+
|
|
2317
|
+
**If Failed:**
|
|
2318
|
+
- Improve variation generator diversity
|
|
2319
|
+
- Enhance test data realism validation
|
|
2320
|
+
- Tune judge prompts for consistency
|
|
2321
|
+
- Don't proceed to Phase 4 until core arena reliable
|
|
2322
|
+
|
|
2323
|
+
---
|
|
2324
|
+
|
|
2325
|
+
### Checkpoint 4: End of Phase 4 (Convergence)
|
|
2326
|
+
|
|
2327
|
+
**Criteria:**
|
|
2328
|
+
- [ ] Arena runs multiple rounds until convergence
|
|
2329
|
+
- [ ] Convergence detects score plateau correctly
|
|
2330
|
+
- [ ] Time and iteration limits prevent infinite loops
|
|
2331
|
+
- [ ] Can interrupt and resume arena from saved state
|
|
2332
|
+
- [ ] Arena completes in < 30 min for moderate skills (test with 5+ examples)
|
|
2333
|
+
- [ ] User receives notification when arena completes
|
|
2334
|
+
|
|
2335
|
+
**If Failed:**
|
|
2336
|
+
- Adjust convergence thresholds (may need < 2% → < 3%)
|
|
2337
|
+
- Optimize arena performance (reduce round time)
|
|
2338
|
+
- Fix state persistence bugs
|
|
2339
|
+
- Don't proceed to Phase 5 until convergence robust
|
|
2340
|
+
|
|
2341
|
+
---
|
|
2342
|
+
|
|
2343
|
+
### Checkpoint 5: End of Phase 5 (User Validation)
|
|
2344
|
+
|
|
2345
|
+
**Criteria:**
|
|
2346
|
+
- [ ] Arena results stored locally and readable
|
|
2347
|
+
- [ ] User can browse all round results
|
|
2348
|
+
- [ ] Side-by-side comparison shows outputs clearly
|
|
2349
|
+
- [ ] User can score outputs (1-5 stars)
|
|
2350
|
+
- [ ] Version comparison (v0.1 vs v1.0) shows improvement metrics
|
|
2351
|
+
- [ ] Feedback submission to database works
|
|
2352
|
+
|
|
2353
|
+
**If Failed:**
|
|
2354
|
+
- Improve UI clarity and usability
|
|
2355
|
+
- Fix local storage bugs
|
|
2356
|
+
- Ensure feedback API working
|
|
2357
|
+
- Don't proceed to Phase 6 until user can validate effectively
|
|
2358
|
+
|
|
2359
|
+
---
|
|
2360
|
+
|
|
2361
|
+
### Checkpoint 6: End of Phase 6 (Collective)
|
|
2362
|
+
|
|
2363
|
+
**Criteria:**
|
|
2364
|
+
- [ ] Can submit skills to collective database
|
|
2365
|
+
- [ ] Submissions include all required metadata (scores, weights, lineage)
|
|
2366
|
+
- [ ] Leaderboard shows top skills accurately
|
|
2367
|
+
- [ ] Lineage tracking works (can trace parent → child)
|
|
2368
|
+
- [ ] ELO ratings update based on usage
|
|
2369
|
+
- [ ] Privacy controls working (anonymized data, no PII)
|
|
2370
|
+
- [ ] Champion comparison detects when user skill beats database leaders
|
|
2371
|
+
|
|
2372
|
+
**If Failed:**
|
|
2373
|
+
- Fix submission API bugs
|
|
2374
|
+
- Improve leaderboard ranking algorithm
|
|
2375
|
+
- Ensure privacy compliance
|
|
2376
|
+
- Don't proceed to Phase 7 until collective reliable
|
|
2377
|
+
|
|
2378
|
+
---
|
|
2379
|
+
|
|
2380
|
+
### Checkpoint 7: End of Phase 7 (Testing)
|
|
2381
|
+
|
|
2382
|
+
**Criteria:**
|
|
2383
|
+
- [ ] Unit test coverage > 80%
|
|
2384
|
+
- [ ] All integration tests passing
|
|
2385
|
+
- [ ] E2E tests cover key user journeys
|
|
2386
|
+
- [ ] Performance benchmarks met:
|
|
2387
|
+
- [ ] Base generation < 30s (p95)
|
|
2388
|
+
- [ ] Database query < 2s (p95)
|
|
2389
|
+
- [ ] Arena completion < 30 min moderate skills (p95)
|
|
2390
|
+
- [ ] Zero critical bugs (P0)
|
|
2391
|
+
- [ ] P1 bugs addressed or accepted as known issues
|
|
2392
|
+
- [ ] Documentation complete (README, usage guide, troubleshooting)
|
|
2393
|
+
|
|
2394
|
+
**If Failed:**
|
|
2395
|
+
- Fix critical bugs before launch
|
|
2396
|
+
- Optimize performance to meet benchmarks
|
|
2397
|
+
- Complete documentation
|
|
2398
|
+
- Don't proceed to Phase 8 until quality gates met
|
|
2399
|
+
|
|
2400
|
+
---
|
|
2401
|
+
|
|
2402
|
+
### Checkpoint 8: Production Launch
|
|
2403
|
+
|
|
2404
|
+
**Criteria (at each rollout stage):**
|
|
2405
|
+
|
|
2406
|
+
**Beta (10 users):**
|
|
2407
|
+
- [ ] At least 5 skills created successfully
|
|
2408
|
+
- [ ] At least 2 arenas completed successfully
|
|
2409
|
+
- [ ] Error rate < 5% (acceptable for beta)
|
|
2410
|
+
- [ ] User feedback collected (survey or interviews)
|
|
2411
|
+
|
|
2412
|
+
**If Beta Failed:**
|
|
2413
|
+
- Fix bugs discovered
|
|
2414
|
+
- Improve UX based on feedback
|
|
2415
|
+
- Iterate before wider rollout
|
|
2416
|
+
|
|
2417
|
+
**Public Launch (all users):**
|
|
2418
|
+
- [ ] Beta successful with no critical issues
|
|
2419
|
+
- [ ] Monitoring shows healthy metrics (API uptime > 99%, error rate < 1%)
|
|
2420
|
+
- [ ] Database has at least 10 seed skills (pre-populated)
|
|
2421
|
+
- [ ] Documentation published and accessible
|
|
2422
|
+
- [ ] Support channel ready (GitHub issues, Discord, etc.)
|
|
2423
|
+
|
|
2424
|
+
**If Public Launch Failed:**
|
|
2425
|
+
- Rollback to beta (disable for most users)
|
|
2426
|
+
- Fix issues
|
|
2427
|
+
- Re-launch when stable
|
|
2428
|
+
|
|
2429
|
+
---
|
|
2430
|
+
|
|
2431
|
+
## Appendix: Task Breakdown Hints
|
|
2432
|
+
|
|
2433
|
+
### Suggested Taskmaster Task Structure
|
|
2434
|
+
|
|
2435
|
+
**Phase 1: Foundation (10 tasks, ~33 hours)**
|
|
2436
|
+
1. Set up Pinecone database and schema (6h) - Dependencies: None
|
|
2437
|
+
2. Implement OpenAI embeddings generation (3h) - Dependencies: Task 1
|
|
2438
|
+
3. Build POST /search API endpoint (8h) - Dependencies: Tasks 1, 2
|
|
2439
|
+
4. Implement requirement extraction (6h) - Dependencies: None
|
|
2440
|
+
5. Build quick pattern research (4h) - Dependencies: None
|
|
2441
|
+
6. Build quick domain research via WebSearch (6h) - Dependencies: None
|
|
2442
|
+
|
|
2443
|
+
**Phase 2: Base Generation (6 tasks, ~34 hours)**
|
|
2444
|
+
7. Build question generator agent (8h) - Dependencies: Tasks 5, 6
|
|
2445
|
+
8. Implement answer-to-weight conversion (5h) - Dependencies: Task 7
|
|
2446
|
+
9. Create base skill template engine (6h) - Dependencies: Tasks 5, 6, 7
|
|
2447
|
+
10. Implement skill deployment automation (3h) - Dependencies: Task 9
|
|
2448
|
+
11. Build database-first workflow integration (8h) - Dependencies: Tasks 3, 10
|
|
2449
|
+
12. Add optional quick scoring of v0.1 (4h) - Dependencies: Task 10
|
|
2450
|
+
|
|
2451
|
+
**Phase 3: Arena Core (8 tasks, ~63 hours)**
|
|
2452
|
+
13. Implement orchestrator-worker pattern via Task tool (12h) - Dependencies: None
|
|
2453
|
+
14. Build skill variation generator agent (10h) - Dependencies: Task 9
|
|
2454
|
+
15. Build realistic test data generator with web search (8h) - Dependencies: None
|
|
2455
|
+
16. Implement skill execution sandbox (8h) - Dependencies: None
|
|
2456
|
+
17. Build output capture system (4h) - Dependencies: Task 16
|
|
2457
|
+
18. Implement LLM-as-judge with pairwise comparison (12h) - Dependencies: Task 8
|
|
2458
|
+
19. Add position bias mitigation (3h) - Dependencies: Task 18
|
|
2459
|
+
20. Implement Bradley-Terry ranking (6h) - Dependencies: Task 18
|
|
2460
|
+
|
|
2461
|
+
**Phase 4: Convergence (6 tasks, ~31 hours)**
|
|
2462
|
+
21. Build tournament iteration loop (8h) - Dependencies: Phase 3 complete
|
|
2463
|
+
22. Implement convergence detection (6h) - Dependencies: Task 21
|
|
2464
|
+
23. Add adaptive tournament sizing (4h) - Dependencies: Task 21
|
|
2465
|
+
24. Build state persistence system (6h) - Dependencies: Task 21
|
|
2466
|
+
25. Implement graceful interruption handling (4h) - Dependencies: Task 24
|
|
2467
|
+
26. Build arena completion notification (3h) - Dependencies: Task 22
|
|
2468
|
+
|
|
2469
|
+
**Phase 5: User Validation (6 tasks, ~31 hours)**
|
|
2470
|
+
27. Create local arena results storage (4h) - Dependencies: Task 17
|
|
2471
|
+
28. Build results browsing UI (8h) - Dependencies: Task 27
|
|
2472
|
+
29. Implement side-by-side output comparison (6h) - Dependencies: Task 28
|
|
2473
|
+
30. Add user scoring interface (4h) - Dependencies: Task 29
|
|
2474
|
+
31. Build version comparison (v0.1 vs v1.0) (6h) - Dependencies: Task 27
|
|
2475
|
+
32. Implement feedback submission to database (3h) - Dependencies: Task 30
|
|
2476
|
+
|
|
2477
|
+
**Phase 6: Collective (7 tasks, ~36 hours)**
|
|
2478
|
+
33. Implement POST /submit API endpoint (8h) - Dependencies: Task 1
|
|
2479
|
+
34. Build submission UI with privacy controls (6h) - Dependencies: Task 26
|
|
2480
|
+
35. Add champion comparison logic (3h) - Dependencies: Tasks 3, 22
|
|
2481
|
+
36. Implement GET /leaderboard endpoint (4h) - Dependencies: Task 1
|
|
2482
|
+
37. Build leaderboard display UI (6h) - Dependencies: Task 36
|
|
2483
|
+
38. Add skill lineage tracking (4h) - Dependencies: Task 33
|
|
2484
|
+
39. Implement ELO rating calculation (5h) - Dependencies: Task 33
|
|
2485
|
+
|
|
2486
|
+
**Phase 7: Testing (7 tasks, ~74 hours)**
|
|
2487
|
+
40. Write comprehensive unit tests (16h) - Dependencies: All features
|
|
2488
|
+
41. Write integration tests (12h) - Dependencies: All features
|
|
2489
|
+
42. Write E2E tests (10h) - Dependencies: All features
|
|
2490
|
+
43. Performance testing and optimization (8h) - Dependencies: All features
|
|
2491
|
+
44. Bug fixes from testing (16h) - Dependencies: Tasks 40-43
|
|
2492
|
+
45. Write user documentation (8h) - Dependencies: Task 44
|
|
2493
|
+
46. Create example arena results (4h) - Dependencies: Task 44
|
|
2494
|
+
|
|
2495
|
+
**Phase 8: Deployment (8 tasks, ~32 hours)**
|
|
2496
|
+
47. Deploy serverless API to production (4h) - Dependencies: Task 44
|
|
2497
|
+
48. Set up Pinecone production database (2h) - Dependencies: Task 44
|
|
2498
|
+
49. Deploy enhanced skill_creating skill (2h) - Dependencies: Task 44
|
|
2499
|
+
50. Set up monitoring and alerting (4h) - Dependencies: Task 47
|
|
2500
|
+
51. Gradual rollout (beta users) (2h) - Dependencies: Task 50
|
|
2501
|
+
52. Gather feedback from beta users (8h) - Dependencies: Task 51
|
|
2502
|
+
53. Iterate based on feedback (8h) - Dependencies: Task 52
|
|
2503
|
+
54. Full public launch (2h) - Dependencies: Task 53
|
|
2504
|
+
|
|
2505
|
+
**Total: 54 tasks, ~334 hours**
|
|
2506
|
+
|
|
2507
|
+
---
|
|
2508
|
+
|
|
2509
|
+
### Parallelizable Tasks
|
|
2510
|
+
|
|
2511
|
+
**Can work in parallel:**
|
|
2512
|
+
|
|
2513
|
+
**Phase 1:**
|
|
2514
|
+
- Tasks 1-2 (sequential), Tasks 4-6 (parallel)
|
|
2515
|
+
|
|
2516
|
+
**Phase 2:**
|
|
2517
|
+
- Task 7 (after 5-6), Tasks 8-10 (mostly sequential), Task 11 (integrates), Task 12 (parallel with 11)
|
|
2518
|
+
|
|
2519
|
+
**Phase 3:**
|
|
2520
|
+
- Tasks 13-15 (parallel), Tasks 16-17 (sequential), Tasks 18-20 (sequential)
|
|
2521
|
+
- Two sub-teams: Orchestration (13-15) and Execution+Judge (16-20)
|
|
2522
|
+
|
|
2523
|
+
**Phase 4:**
|
|
2524
|
+
- Tasks 21-26 mostly sequential (each depends on previous)
|
|
2525
|
+
|
|
2526
|
+
**Phase 5:**
|
|
2527
|
+
- Tasks 27-32 mostly sequential, but Task 31 parallel with 28-30
|
|
2528
|
+
|
|
2529
|
+
**Phase 6:**
|
|
2530
|
+
- Tasks 33-39 parallel after Task 33 complete
|
|
2531
|
+
|
|
2532
|
+
**Phase 7:**
|
|
2533
|
+
- Tasks 40-43 parallel, Task 44 after all tests, Tasks 45-46 parallel
|
|
2534
|
+
|
|
2535
|
+
**Phase 8:**
|
|
2536
|
+
- Tasks 47-49 parallel, Task 50 after 47, Tasks 51-54 sequential
|
|
2537
|
+
|
|
2538
|
+
---
|
|
2539
|
+
|
|
2540
|
+
### Critical Path Tasks
|
|
2541
|
+
|
|
2542
|
+
**Critical path (longest dependency chain):**
|
|
2543
|
+
1. Set up Pinecone (Task 1)
|
|
2544
|
+
2. Build embeddings (Task 2)
|
|
2545
|
+
3. Build search API (Task 3)
|
|
2546
|
+
4. Build workflow integration (Task 11)
|
|
2547
|
+
5. Implement orchestrator (Task 13)
|
|
2548
|
+
6. Implement judge (Task 18)
|
|
2549
|
+
7. Build tournament loop (Task 21)
|
|
2550
|
+
8. Implement convergence (Task 22)
|
|
2551
|
+
9. Create local storage (Task 27)
|
|
2552
|
+
10. Implement submit API (Task 33)
|
|
2553
|
+
11. Bug fixes (Task 44)
|
|
2554
|
+
12. Public launch (Task 54)
|
|
2555
|
+
|
|
2556
|
+
**Critical path duration:** ~85 hours (~11 weeks at 8h/week with 1 person on critical path)
|
|
2557
|
+
|
|
2558
|
+
**With parallel teams (3 people):**
|
|
2559
|
+
- Critical path person: 85 hours
|
|
2560
|
+
- Arena team: 63 hours (Phase 3)
|
|
2561
|
+
- UI/Validation team: 31 hours (Phase 5)
|
|
2562
|
+
- **Total calendar time: ~14 weeks (3.5 months)**
|
|
2563
|
+
|
|
2564
|
+
---
|
|
2565
|
+
|
|
2566
|
+
**End of PRD**
|
|
2567
|
+
|
|
2568
|
+
*This PRD is optimized for taskmaster AI task generation. All requirements include task breakdown hints, complexity estimates, and dependency mapping to enable effective automated task planning.*
|
|
2569
|
+
|
|
2570
|
+
**Ready for Implementation:** Yes
|
|
2571
|
+
**Next Steps:** Review PRD → Approve → Begin Phase 1 (Foundation) → taskmaster init → taskmaster generate
|