aiag-cli 2.2.2 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (77) hide show
  1. package/README.md +72 -37
  2. package/dist/cli.js +30 -2
  3. package/dist/cli.js.map +1 -1
  4. package/dist/commands/auto.js +45 -41
  5. package/dist/commands/auto.js.map +1 -1
  6. package/dist/commands/feature.d.ts +11 -0
  7. package/dist/commands/feature.d.ts.map +1 -0
  8. package/dist/commands/feature.js +153 -0
  9. package/dist/commands/feature.js.map +1 -0
  10. package/dist/commands/init.d.ts +1 -1
  11. package/dist/commands/init.d.ts.map +1 -1
  12. package/dist/commands/init.js +29 -78
  13. package/dist/commands/init.js.map +1 -1
  14. package/dist/commands/prd.d.ts +12 -0
  15. package/dist/commands/prd.d.ts.map +1 -0
  16. package/dist/commands/prd.js +179 -0
  17. package/dist/commands/prd.js.map +1 -0
  18. package/dist/prompts/coding.d.ts.map +1 -1
  19. package/dist/prompts/coding.js +12 -0
  20. package/dist/prompts/coding.js.map +1 -1
  21. package/dist/prompts/index.d.ts +2 -0
  22. package/dist/prompts/index.d.ts.map +1 -1
  23. package/dist/prompts/index.js +2 -0
  24. package/dist/prompts/index.js.map +1 -1
  25. package/dist/prompts/initializer.d.ts.map +1 -1
  26. package/dist/prompts/initializer.js +6 -0
  27. package/dist/prompts/initializer.js.map +1 -1
  28. package/dist/prompts/prd.d.ts +28 -0
  29. package/dist/prompts/prd.d.ts.map +1 -0
  30. package/dist/prompts/prd.js +105 -0
  31. package/dist/prompts/prd.js.map +1 -0
  32. package/dist/skills/index.d.ts +12 -0
  33. package/dist/skills/index.d.ts.map +1 -0
  34. package/dist/skills/index.js +12 -0
  35. package/dist/skills/index.js.map +1 -0
  36. package/dist/skills/installer.d.ts +38 -0
  37. package/dist/skills/installer.d.ts.map +1 -0
  38. package/dist/skills/installer.js +153 -0
  39. package/dist/skills/installer.js.map +1 -0
  40. package/dist/skills/loader.d.ts +34 -0
  41. package/dist/skills/loader.d.ts.map +1 -0
  42. package/dist/skills/loader.js +134 -0
  43. package/dist/skills/loader.js.map +1 -0
  44. package/dist/skills/runner.d.ts +14 -0
  45. package/dist/skills/runner.d.ts.map +1 -0
  46. package/dist/skills/runner.js +238 -0
  47. package/dist/skills/runner.js.map +1 -0
  48. package/dist/types.d.ts +127 -0
  49. package/dist/types.d.ts.map +1 -1
  50. package/dist/utils/prd.d.ts +21 -0
  51. package/dist/utils/prd.d.ts.map +1 -1
  52. package/dist/utils/prd.js +69 -0
  53. package/dist/utils/prd.js.map +1 -1
  54. package/dist/utils/taskmasterConverter.d.ts +72 -0
  55. package/dist/utils/taskmasterConverter.d.ts.map +1 -0
  56. package/dist/utils/taskmasterConverter.js +401 -0
  57. package/dist/utils/taskmasterConverter.js.map +1 -0
  58. package/dist/utils/taskmasterParser.d.ts +35 -0
  59. package/dist/utils/taskmasterParser.d.ts.map +1 -0
  60. package/dist/utils/taskmasterParser.js +259 -0
  61. package/dist/utils/taskmasterParser.js.map +1 -0
  62. package/package.json +1 -1
  63. package/templates/skills/prd-taskmaster/.taskmaster/docs/prd.md +2571 -0
  64. package/templates/skills/prd-taskmaster/.taskmaster/scripts/execution-state.py +87 -0
  65. package/templates/skills/prd-taskmaster/.taskmaster/scripts/learn-accuracy.py +113 -0
  66. package/templates/skills/prd-taskmaster/.taskmaster/scripts/rollback.sh +71 -0
  67. package/templates/skills/prd-taskmaster/.taskmaster/scripts/security-audit.py +130 -0
  68. package/templates/skills/prd-taskmaster/.taskmaster/scripts/track-time.py +133 -0
  69. package/templates/skills/prd-taskmaster/LICENSE +21 -0
  70. package/templates/skills/prd-taskmaster/README.md +608 -0
  71. package/templates/skills/prd-taskmaster/SKILL.md +1258 -0
  72. package/templates/skills/prd-taskmaster/reference/taskmaster-integration-guide.md +645 -0
  73. package/templates/skills/prd-taskmaster/reference/validation-checklist.md +394 -0
  74. package/templates/skills/prd-taskmaster/scripts/setup-taskmaster.sh +112 -0
  75. package/templates/skills/prd-taskmaster/templates/CLAUDE.md.template +635 -0
  76. package/templates/skills/prd-taskmaster/templates/taskmaster-prd-comprehensive.md +983 -0
  77. package/templates/skills/prd-taskmaster/templates/taskmaster-prd-minimal.md +103 -0
@@ -0,0 +1,2571 @@
1
+ # PRD: Agentic Arena-Based Skill Creation & Optimization System
2
+
3
+ **Author:** anombyte
4
+ **Date:** 2025-01-22
5
+ **Status:** Ready for Implementation
6
+ **Version:** 4.0 (Taskmaster Optimized)
7
+ **Taskmaster Optimized:** Yes
8
+ **Original Version:** v3.0 from skill_creating/planning/PRD.md
9
+
10
+ ---
11
+
12
+ ## Table of Contents
13
+
14
+ 1. [Executive Summary](#executive-summary)
15
+ 2. [Problem Statement](#problem-statement)
16
+ 3. [Goals & Success Metrics](#goals--success-metrics)
17
+ 4. [User Stories](#user-stories)
18
+ 5. [Functional Requirements](#functional-requirements)
19
+ 6. [Non-Functional Requirements](#non-functional-requirements)
20
+ 7. [Technical Considerations](#technical-considerations)
21
+ 8. [Implementation Roadmap](#implementation-roadmap)
22
+ 9. [Out of Scope](#out-of-scope)
23
+ 10. [Open Questions & Risks](#open-questions--risks)
24
+ 11. [Validation Checkpoints](#validation-checkpoints)
25
+ 12. [Appendix: Task Breakdown Hints](#appendix-task-breakdown-hints)
26
+
27
+ ---
28
+
29
+ ## Executive Summary
30
+
31
+ Current Claude Code skill creation is "write once" without optimization or validation. Users need an intelligent system that evolves skills through tournament-style arena battles with empirical testing—comparing real outputs, not theoretical code quality. This system provides database-first collective knowledge (show existing skills before creating), delivers quick base skills in 30 seconds for immediate use, then runs background optimization (25-45 min) using agentic orchestration and LLM-as-judge evaluation. Expected impact: average skill scores improve from 78/100 (base) to 93/100 (optimized), with 50% of requests served from collective database within 3 months.
32
+
33
+ ---
34
+
35
+ ## Problem Statement
36
+
37
+ ### Current Situation
38
+
39
+ Users create Claude Code skills manually by writing SKILL.md files without:
40
+ - **Quality validation**: No way to know if skill will work well before deploying
41
+ - **Optimization**: Skills are written once and never improved
42
+ - **Collective knowledge**: Each user reinvents solutions others have already created
43
+ - **Empirical testing**: Skills are evaluated by reading code, not running them
44
+ - **Evolution mechanism**: No systematic way to iterate and improve skills
45
+
46
+ **Evidence:**
47
+ - From user requirement: "I can't create PRDs myself. I want the best possible PRD for optimal outcomes."
48
+ - User explicitly stated: "Planning is 95% of the work with vibe coding"
49
+ - Current skill_creating skill guides creation but doesn't optimize or validate
50
+
51
+ ### User Impact
52
+
53
+ - **Who is affected:** Claude Code users creating custom skills (engineers, technical users)
54
+ - **How they're affected:**
55
+ - Spend hours writing skills that may not work well
56
+ - No feedback on skill quality until they use it in production
57
+ - Reinvent solutions others have already created
58
+ - No systematic improvement process
59
+ - Miss opportunities to leverage collective expertise
60
+ - **Severity:** High - Directly impacts development velocity and skill effectiveness
61
+
62
+ ### Business Impact
63
+
64
+ - **Cost of problem:**
65
+ - Wasted time creating suboptimal skills (estimated 2-4 hours per skill)
66
+ - Poor skill quality reduces Claude Code effectiveness
67
+ - User frustration from trial-and-error skill development
68
+ - **Opportunity cost:**
69
+ - Missing collective intelligence benefits (GitHub Copilot-like network effects)
70
+ - Not capitalizing on community improvements
71
+ - Slower Claude Code adoption due to skill creation friction
72
+ - **Strategic importance:**
73
+ - Skills are core differentiator for Claude Code vs competitors
74
+ - Quality skill ecosystem drives user retention and engagement
75
+ - Collective evolution enables exponential improvement vs linear
76
+
77
+ ### Why Solve This Now?
78
+
79
+ 1. **2025 LLM evaluation best practices available**: Arena-Lite architecture, LLM-as-judge patterns, realistic test generation
80
+ 2. **Technical capability ready**: Claude Code Task tool enables background agentic orchestration
81
+ 3. **User demand clear**: Explicit request for "best possible" skills with automated optimization
82
+ 4. **Competitive timing**: First to market with collective skill evolution for AI coding tools
83
+ 5. **Foundation for future**: This enables advanced features (server farms, continuous evolution)
84
+
85
+ ---
86
+
87
+ ## Goals & Success Metrics
88
+
89
+ ### Goal 1: Improve Skill Quality Through Arena Optimization
90
+
91
+ **Description:** Skills optimized through arena battles score significantly higher than base versions
92
+
93
+ **Metric:** Average score improvement (final vs base skill)
94
+
95
+ **Baseline:** 0 points (no optimization exists today)
96
+
97
+ **Target:** +15 points average (e.g., 78/100 base → 93/100 optimized)
98
+
99
+ **Timeframe:** Measured per skill, target achieved for 80% of skills within 30 min arena completion
100
+
101
+ **Measurement Method:** Automated scoring via LLM-as-judge comparing base (v0.1) vs optimized (v1.0) skill outputs
102
+
103
+ ---
104
+
105
+ ### Goal 2: Enable Collective Knowledge Reuse
106
+
107
+ **Description:** Users find and reuse existing high-quality skills instead of recreating
108
+
109
+ **Metric:** Reuse rate (% of skill requests served from collective database)
110
+
111
+ **Baseline:** 0% (no collective database exists)
112
+
113
+ **Target:** 50% of requests match existing skills within 3 months
114
+
115
+ **Timeframe:** 3 months post-launch
116
+
117
+ **Measurement Method:** Track database queries with confidence > 0.8, user selection of existing vs "build custom"
118
+
119
+ ---
120
+
121
+ ### Goal 3: Fast Time-to-First-Value
122
+
123
+ **Description:** Users get working skill immediately while optimization runs in background
124
+
125
+ **Metric:** Time to base skill (v0.1) delivery
126
+
127
+ **Baseline:** N/A (current: manual creation 30-120 min)
128
+
129
+ **Target:** < 30 seconds for base skill generation
130
+
131
+ **Timeframe:** Every skill creation
132
+
133
+ **Measurement Method:** Track timestamp from user request to v0.1 skill deployed and usable
134
+
135
+ ---
136
+
137
+ ### Goal 4: Reliable Arena Completion Time
138
+
139
+ **Description:** Background optimization completes within predictable time windows
140
+
141
+ **Metric:** Arena completion time (p95)
142
+
143
+ **Baseline:** N/A
144
+
145
+ **Target:** < 30 minutes for moderate-complexity skills (p95)
146
+
147
+ **Timeframe:** Every arena execution
148
+
149
+ **Measurement Method:** Track arena start → convergence timestamps, categorize by skill complexity
150
+
151
+ ---
152
+
153
+ ### Goal 5: High User Satisfaction
154
+
155
+ **Description:** Users rate optimized skills highly and adopt the system
156
+
157
+ **Metric:** Average user rating of optimized skills
158
+
159
+ **Baseline:** N/A
160
+
161
+ **Target:** ≥ 4.5/5 stars average
162
+
163
+ **Timeframe:** Ongoing (minimum 50 ratings for statistical validity)
164
+
165
+ **Measurement Method:** Post-execution optional rating prompt (1-5 stars), aggregate in database
166
+
167
+ ---
168
+
169
+ ### Goal 6: Build Thriving Collective Database
170
+
171
+ **Description:** Grow database of high-quality community-contributed skills
172
+
173
+ **Metric:** Total unique skills in collective database
174
+
175
+ **Baseline:** 0 skills
176
+
177
+ **Target:** 100+ skills across diverse domains in 3 months
178
+
179
+ **Timeframe:** 3 months post-launch
180
+
181
+ **Measurement Method:** Count unique skill_id entries in Pinecone database
182
+
183
+ ---
184
+
185
+ ## User Stories
186
+
187
+ ### Story 1: Database-First Skill Discovery
188
+
189
+ **As a** Claude Code user,
190
+ **I want to** see existing high-quality skills before creating a new one,
191
+ **So that I can** reuse proven solutions instead of reinventing.
192
+
193
+ **Acceptance Criteria:**
194
+ - [ ] System queries Pinecone database before generating new skill
195
+ - [ ] Shows matching skills with arena scores (e.g., 91.5/100)
196
+ - [ ] Shows user ratings (e.g., ⭐4.7/5 from 342 users)
197
+ - [ ] Shows last updated timestamp and key features
198
+ - [ ] User can select existing skill or choose "Build custom"
199
+ - [ ] Confidence scoring for matches (>0.8 shown, <0.8 clarifies)
200
+ - [ ] Results appear within 2 seconds of query
201
+
202
+ **Task Breakdown Hint:**
203
+ - Task 1.1: Implement Pinecone vector search integration (6h)
204
+ - Task 1.2: Build requirement fingerprint generation (4h)
205
+ - Task 1.3: Create search results display UI (5h)
206
+ - Task 1.4: Add confidence scoring and ranking logic (3h)
207
+ - Task 1.5: Implement user selection workflow (3h)
208
+ - Task 1.6: Write tests for search accuracy (4h)
209
+
210
+ **Dependencies:** R11 (Collective API & Database), R4 (Requirement Extraction)
211
+
212
+ ---
213
+
214
+ ### Story 2: Progressive Skill Delivery (Quick Base → Optimized)
215
+
216
+ **As a** Claude Code user,
217
+ **I want to** get a working skill immediately while optimization runs in background,
218
+ **So that I can** start using it right away without waiting 30 minutes.
219
+
220
+ **Acceptance Criteria:**
221
+ - [ ] Quick initial research completes in <30 seconds
222
+ - [ ] Base skill (v0.1) generated and deployed immediately
223
+ - [ ] User can use v0.1 skill while arena runs in background
224
+ - [ ] Optional quick scoring of v0.1 (user can decline)
225
+ - [ ] Background arena starts automatically after v0.1 delivery
226
+ - [ ] User receives notification when optimized v1.0 ready
227
+ - [ ] Comparison shows improvement metrics (e.g., +15 points)
228
+ - [ ] User can review outputs and choose to deploy v1.0 or keep v0.1
229
+
230
+ **Task Breakdown Hint:**
231
+ - Task 2.1: Implement dual-track research system (quick + deep) (8h)
232
+ - Task 2.2: Build base skill generation from quick research (6h)
233
+ - Task 2.3: Create background job orchestration via Task tool (10h)
234
+ - Task 2.4: Implement optional quick scoring workflow (4h)
235
+ - Task 2.5: Build notification system for arena completion (3h)
236
+ - Task 2.6: Create comparison UI for v0.1 vs v1.0 (6h)
237
+ - Task 2.7: Write tests for progressive delivery flow (5h)
238
+
239
+ **Dependencies:** R2 (Dual-Track Research), R10 (Background Execution), R14 (Adaptive Complexity)
240
+
241
+ ---
242
+
243
+ ### Story 3: Agentic Question Generation
244
+
245
+ **As a** Claude Code user,
246
+ **I want** domain-specific questions about my requirements,
247
+ **So that** the system can optimize for what matters most to me.
248
+
249
+ **Acceptance Criteria:**
250
+ - [ ] System analyzes skill domain from user request
251
+ - [ ] Generates 3-7 questions specific to domain (e.g., PRD vs PDF extraction)
252
+ - [ ] Questions map to evaluation weight priorities
253
+ - [ ] Example question: "What's most important: completeness or speed?"
254
+ - [ ] User answers converted to weighted criteria (e.g., Quality: 64%, Speed: 5%)
255
+ - [ ] Default weights provided if user skips questions
256
+ - [ ] Question generation completes within quick research phase (<30s)
257
+
258
+ **Task Breakdown Hint:**
259
+ - Task 3.1: Build domain analysis agent (6h)
260
+ - Task 3.2: Create question generation templates by domain (8h)
261
+ - Task 3.3: Implement answer-to-weight conversion logic (5h)
262
+ - Task 3.4: Add default weight fallback (2h)
263
+ - Task 3.5: Write tests for question relevance (4h)
264
+
265
+ **Dependencies:** R3 (Agentic Question Generation), AGENTIC_WEIGHTING_SOLUTIONS.md integration
266
+
267
+ ---
268
+
269
+ ### Story 4: Tournament Arena with Empirical Testing
270
+
271
+ **As a** system administrator,
272
+ **I want** skills to compete in arena battles using real outputs,
273
+ **So that** winners are selected based on empirical quality, not theoretical code review.
274
+
275
+ **Acceptance Criteria:**
276
+ - [ ] Arena generates 3 skill variations (A, B, C) in Round 1
277
+ - [ ] All variations execute with identical realistic test input
278
+ - [ ] System captures complete real outputs (PRDs, code, data, etc.)
279
+ - [ ] LLM judge compares outputs directly (not code)
280
+ - [ ] Judge uses weighted criteria from user questions
281
+ - [ ] Pairwise comparisons with position bias mitigation (randomized order)
282
+ - [ ] Bradley-Terry ranking determines winner
283
+ - [ ] Winner advances to next round vs 2 new refined variations
284
+ - [ ] Arena stops when convergence detected (score plateau, time limit, or target achieved)
285
+ - [ ] Maximum 10 rounds or 30 minutes (whichever comes first)
286
+
287
+ **Task Breakdown Hint:**
288
+ - Task 4.1: Implement skill variation generator agent (10h)
289
+ - Task 4.2: Build skill execution isolation sandbox (8h)
290
+ - Task 4.3: Create output capture system (4h)
291
+ - Task 4.4: Implement LLM-as-judge with pairwise comparison (12h)
292
+ - Task 4.5: Add Bradley-Terry ranking algorithm (6h)
293
+ - Task 4.6: Build convergence detection logic (5h)
294
+ - Task 4.7: Create tournament orchestration loop (8h)
295
+ - Task 4.8: Write comprehensive arena tests (10h)
296
+
297
+ **Dependencies:** R5 (Realistic Test Data), R6 (Tournament Arena), R7 (Skill Execution), R8 (LLM-as-Judge), R9 (Convergence)
298
+
299
+ ---
300
+
301
+ ### Story 5: Realistic Test Data Generation
302
+
303
+ **As a** system administrator,
304
+ **I want** realistic test scenarios for skill evaluation,
305
+ **So that** arena battles reflect real-world usage, not toy examples.
306
+
307
+ **Acceptance Criteria:**
308
+ - [ ] Agent discovers realistic use cases via web search
309
+ - [ ] LLM takes persona appropriate to skill type (e.g., "Product Manager" for PRD skills)
310
+ - [ ] Generates realistic input data (not "Create a PRD" but "Add password reset to fintech SaaS app")
311
+ - [ ] Test scenarios evolve across rounds (simple → complex → edge case)
312
+ - [ ] Validates realism through domain pattern matching
313
+ - [ ] Caches validated scenarios in database for reuse
314
+ - [ ] Each skill tested with minimum 1 realistic scenario per round
315
+
316
+ **Task Breakdown Hint:**
317
+ - Task 5.1: Build test data generator agent with web search (8h)
318
+ - Task 5.2: Create persona-based scenario generation (6h)
319
+ - Task 5.3: Implement scenario evolution logic (5h)
320
+ - Task 5.4: Add realism validation (4h)
321
+ - Task 5.5: Build scenario caching in Pinecone (4h)
322
+ - Task 5.6: Write tests for scenario quality (4h)
323
+
324
+ **Dependencies:** R5 (Realistic Test Data Generation), Pinecone database
325
+
326
+ ---
327
+
328
+ ### Story 6: User-in-the-Loop Validation
329
+
330
+ **As a** Claude Code user,
331
+ **I want to** review arena outputs and score them myself,
332
+ **So that** I can validate automated judgments and provide feedback.
333
+
334
+ **Acceptance Criteria:**
335
+ - [ ] All arena results stored locally in `.claude/skills/[skill-name]/arena_results/`
336
+ - [ ] Results stored as JSON with inputs, outputs, scores, reasoning
337
+ - [ ] User can browse results after arena completion
338
+ - [ ] UI shows side-by-side comparison of outputs
339
+ - [ ] User can score each output (1-5 stars)
340
+ - [ ] User feedback submitted to database (opt-in)
341
+ - [ ] Feedback improves future arena weights and judgments
342
+
343
+ **Task Breakdown Hint:**
344
+ - Task 6.1: Create local arena results storage system (4h)
345
+ - Task 6.2: Build results browsing UI (8h)
346
+ - Task 6.3: Implement side-by-side output comparison (6h)
347
+ - Task 6.4: Add user scoring interface (4h)
348
+ - Task 6.5: Build feedback submission to database (3h)
349
+ - Task 6.6: Write tests for validation flow (4h)
350
+
351
+ **Dependencies:** R12 (User-in-the-Loop Validation), R13 (Feedback Collection)
352
+
353
+ ---
354
+
355
+ ### Story 7: Collective Submission & Leaderboards
356
+
357
+ **As a** Claude Code user,
358
+ **I want to** submit my optimized skill to the collective database,
359
+ **So that** others can benefit and my skill can evolve further.
360
+
361
+ **Acceptance Criteria:**
362
+ - [ ] After arena completion, special notification if skill beats database champions
363
+ - [ ] Opt-in prompt to submit to collective
364
+ - [ ] Submission includes: skill content, arena scores (dimensional + overall), weights used, generation/lineage
365
+ - [ ] Privacy-conscious: input hash (not actual input), output samples (500 chars), anonymous user ID
366
+ - [ ] Skills appear in search results for future users
367
+ - [ ] Leaderboard shows top skills by domain
368
+ - [ ] ELO ratings update based on usage and feedback
369
+
370
+ **Task Breakdown Hint:**
371
+ - Task 7.1: Build champion comparison logic (3h)
372
+ - Task 7.2: Create submission UI with privacy controls (6h)
373
+ - Task 7.3: Implement skill submission API endpoint (8h)
374
+ - Task 7.4: Build leaderboard display (6h)
375
+ - Task 7.5: Add ELO rating calculation (5h)
376
+ - Task 7.6: Write tests for submission flow (5h)
377
+
378
+ **Dependencies:** R11 (Collective API & Database), R13 (Feedback Collection), skill lineage tracking
379
+
380
+ ---
381
+
382
+ ## Functional Requirements
383
+
384
+ ### Must Have (P0) - Critical for MVP
385
+
386
+ #### REQ-001: Database-First Query System
387
+
388
+ **Description:** System MUST query Pinecone collective database before generating new skills, showing existing matches to enable reuse.
389
+
390
+ **Acceptance Criteria:**
391
+ - [ ] Extract requirements from user request ("Create PRD skill for comprehensive planning")
392
+ - [ ] Generate requirement fingerprint (SHA-256 hash of structured requirements)
393
+ - [ ] Query Pinecone with semantic search (embedding + metadata filtering)
394
+ - [ ] Return matches with confidence scores (0-1 scale)
395
+ - [ ] Show skills with confidence > 0.7
396
+ - [ ] Display: name, domain, arena scores (dimensional + overall), user ratings, last updated
397
+ - [ ] User can select existing skill or choose "Build custom"
398
+ - [ ] Query completes in < 2 seconds
399
+
400
+ **Technical Specification:**
401
+ ```typescript
402
+ interface SkillSearchRequest {
403
+ userRequest: string;
404
+ domain?: string; // Optional: "prd-generation", "pdf-extraction", etc.
405
+ }
406
+
407
+ interface SkillSearchResult {
408
+ skillId: string;
409
+ name: string;
410
+ domain: string;
411
+ confidence: number; // 0-1
412
+ scores: {
413
+ completeness: number;
414
+ clarity: number;
415
+ quality: number;
416
+ efficiency: number;
417
+ overall: number;
418
+ };
419
+ feedback: {
420
+ avgRating: number;
421
+ totalRatings: number;
422
+ successRate: number;
423
+ };
424
+ lastUpdated: string; // ISO 8601
425
+ keyFeatures: string[];
426
+ }
427
+
428
+ // API Call
429
+ POST /api/collective/search
430
+ {
431
+ "userRequest": "Create comprehensive PRD skill",
432
+ "domain": "prd-generation"
433
+ }
434
+
435
+ // Response
436
+ {
437
+ "matches": [
438
+ {
439
+ "skillId": "uuid-123",
440
+ "name": "Comprehensive PRD Generator",
441
+ "confidence": 0.92,
442
+ "scores": { "overall": 91.5, ... },
443
+ "feedback": { "avgRating": 4.7, "totalRatings": 342 },
444
+ ...
445
+ }
446
+ ],
447
+ "queryTime": 1.2
448
+ }
449
+ ```
450
+
451
+ **Task Breakdown:**
452
+ - Implement requirement extraction and fingerprinting: Medium (6h)
453
+ - Build Pinecone semantic search integration: Medium (8h)
454
+ - Add confidence scoring logic: Small (4h)
455
+ - Create search results display: Medium (6h)
456
+ - Write integration tests: Small (4h)
457
+
458
+ **Dependencies:** Pinecone database setup, embedding model (OpenAI text-embedding-3-small)
459
+
460
+ ---
461
+
462
+ #### REQ-002: Quick Base Skill Generation (v0.1)
463
+
464
+ **Description:** Generate working base skill in < 30 seconds using quick initial research, deployable immediately.
465
+
466
+ **Acceptance Criteria:**
467
+ - [ ] Quick pattern research: Scan awesome-claude-skills examples (< 10s)
468
+ - [ ] Quick domain research: Basic WebSearch for best practices (< 15s)
469
+ - [ ] Generate domain-specific questions (3-7 questions) (< 5s)
470
+ - [ ] User answers questions or accepts defaults
471
+ - [ ] Generate base SKILL.md from quick research + answers (< 5s)
472
+ - [ ] Deploy to `.claude/skills/[skill-name]/SKILL.md`
473
+ - [ ] Skill immediately usable (activates on triggers)
474
+ - [ ] Total time user request → deployed v0.1: < 30 seconds (excluding user answer time)
475
+
476
+ **Technical Specification:**
477
+ ```python
478
+ # Quick Research Flow
479
+ def generate_base_skill(user_request, user_answers):
480
+ # Phase 1: Quick Research (parallel, 15s total)
481
+ pattern_research = quick_scan_patterns(user_request) # 10s
482
+ domain_research = quick_web_search(user_request) # 15s (parallel)
483
+
484
+ # Phase 2: Question Generation (5s)
485
+ questions = generate_questions(pattern_research, domain_research)
486
+
487
+ # User answers (time not counted)
488
+ answers = await user.answer(questions) or get_defaults()
489
+
490
+ # Phase 3: Skill Generation (5s)
491
+ skill_content = generate_skill_md(
492
+ pattern_research,
493
+ domain_research,
494
+ answers
495
+ )
496
+
497
+ # Phase 4: Deployment (1s)
498
+ deploy_skill(skill_content, skill_name)
499
+
500
+ return {"version": "0.1", "path": skill_path, "time": elapsed}
501
+ ```
502
+
503
+ **Task Breakdown:**
504
+ - Implement quick pattern scanner: Medium (6h)
505
+ - Build quick domain research: Small (4h)
506
+ - Create question generator agent: Medium (8h)
507
+ - Implement base skill template engine: Medium (6h)
508
+ - Add deployment automation: Small (3h)
509
+ - Write tests for 30s SLA: Small (4h)
510
+
511
+ **Dependencies:** WebSearch tool, pattern database (awesome-claude-skills)
512
+
513
+ ---
514
+
515
+ #### REQ-003: Background Arena Orchestration
516
+
517
+ **Description:** After v0.1 delivery, automatically start background arena optimization using Task tool for non-blocking execution.
518
+
519
+ **Acceptance Criteria:**
520
+ - [ ] Arena starts automatically after v0.1 deployed
521
+ - [ ] User can continue working (non-blocking)
522
+ - [ ] Orchestrator-worker pattern: central orchestrator + worker agents
523
+ - [ ] Workers run in parallel via Task tool
524
+ - [ ] State persistence (can resume if interrupted)
525
+ - [ ] Progress indicator (optional, non-intrusive)
526
+ - [ ] Graceful handling of interruptions
527
+ - [ ] User can manually stop optimization
528
+ - [ ] Arena completes in < 30 min for moderate skills (p95)
529
+
530
+ **Technical Specification:**
531
+ ```typescript
532
+ interface ArenaOrchestrator {
533
+ skillName: string;
534
+ baseSkill: string; // v0.1 content
535
+ userWeights: WeightConfig;
536
+ complexity: 1-10;
537
+
538
+ async run(): ArenaResult {
539
+ // Start background job
540
+ const jobId = await Task.start({
541
+ subagent_type: "arena-orchestrator",
542
+ prompt: `Run arena optimization for ${skillName}...`,
543
+ async: true
544
+ });
545
+
546
+ // State machine
547
+ while (!converged) {
548
+ // Round N
549
+ variations = await generateVariations(currentBest);
550
+ testData = await generateRealisticTest(round);
551
+ outputs = await Promise.all(
552
+ variations.map(v => executeSkill(v, testData))
553
+ );
554
+ scores = await judgeOutputs(outputs, userWeights);
555
+ currentBest = selectWinner(scores);
556
+
557
+ // Check convergence
558
+ if (shouldStop(scores, elapsed, rounds)) {
559
+ converged = true;
560
+ }
561
+ }
562
+
563
+ return { winner: currentBest, rounds, time: elapsed };
564
+ }
565
+ }
566
+ ```
567
+
568
+ **Task Breakdown:**
569
+ - Implement orchestrator-worker pattern: Large (12h)
570
+ - Build state persistence system: Medium (6h)
571
+ - Add graceful interruption handling: Small (4h)
572
+ - Create progress tracking (optional): Small (3h)
573
+ - Write orchestration tests: Medium (8h)
574
+
575
+ **Dependencies:** Claude Code Task tool, R6 (Tournament Arena), R9 (Convergence Detection)
576
+
577
+ ---
578
+
579
+ #### REQ-004: Realistic Test Data Generation
580
+
581
+ **Description:** Generate realistic test scenarios using web search and persona-based LLM generation, not generic toy examples.
582
+
583
+ **Acceptance Criteria:**
584
+ - [ ] Web search for domain-specific realistic examples
585
+ - [ ] LLM takes persona appropriate to skill (e.g., "Product Manager" for PRD)
586
+ - [ ] Generates realistic input matching real-world complexity
587
+ - [ ] Example: NOT "Create a PRD" but "Add OAuth 2.0 authentication supporting Google, Microsoft, GitHub to B2B SaaS app with existing auth system, 2FA support, SOC 2 compliance"
588
+ - [ ] Scenarios evolve across rounds: Round 1 (simple), Round 2 (complex), Round 3 (edge case)
589
+ - [ ] Validates realism via domain pattern matching
590
+ - [ ] Caches validated scenarios in Pinecone for reuse
591
+
592
+ **Technical Specification:**
593
+ ```python
594
+ # Realistic Test Data Generation
595
+ class RealisticTestGenerator:
596
+ def generate(self, skill_domain, round_num):
597
+ # Step 1: Web search for real examples
598
+ examples = web_search(f"{skill_domain} use cases 2025")
599
+
600
+ # Step 2: Take persona
601
+ persona = get_persona(skill_domain)
602
+ # e.g., "Product Manager at B2B SaaS company"
603
+
604
+ # Step 3: Generate realistic scenario
605
+ scenario = llm.generate(
606
+ prompt=f"""You are a {persona}.
607
+ Generate a realistic request for {skill_domain}.
608
+ Based on these real-world examples: {examples}
609
+
610
+ Complexity level: {get_complexity(round_num)}
611
+ - Round 1: Simple, common use case
612
+ - Round 2: Complex, multi-part scenario
613
+ - Round 3: Edge case, unusual constraints
614
+
615
+ Be specific, include real constraints and context."""
616
+ )
617
+
618
+ # Step 4: Validate realism
619
+ if validate_realism(scenario, examples):
620
+ cache_scenario(skill_domain, scenario)
621
+ return scenario
622
+ else:
623
+ return generate(skill_domain, round_num) # Retry
624
+ ```
625
+
626
+ **Task Breakdown:**
627
+ - Build web search integration for examples: Medium (5h)
628
+ - Create persona mapping by domain: Small (4h)
629
+ - Implement scenario generation with evolution: Medium (8h)
630
+ - Add realism validation: Medium (5h)
631
+ - Build scenario caching in Pinecone: Small (4h)
632
+ - Write tests for scenario quality: Medium (6h)
633
+
634
+ **Dependencies:** WebSearch tool, Pinecone database, LLM access
635
+
636
+ ---
637
+
638
+ #### REQ-005: Arena Tournament with Pairwise Comparison
639
+
640
+ **Description:** Run tournament battles using Arena-Lite architecture with direct pairwise comparison of real skill outputs.
641
+
642
+ **Acceptance Criteria:**
643
+ - [ ] Round 1: Generate 3 variations (A, B, C)
644
+ - [ ] Execute all variations with identical test input
645
+ - [ ] Capture complete real outputs (not truncated)
646
+ - [ ] Judge compares outputs pairwise (A vs B, B vs C, A vs C)
647
+ - [ ] Judge uses weighted criteria from user questions
648
+ - [ ] Position bias mitigation: randomize output order
649
+ - [ ] Bradley-Terry ranking from pairwise results
650
+ - [ ] Winner (highest rank) advances to next round
651
+ - [ ] Next round: Winner vs 2 new refined variations
652
+ - [ ] Repeat until convergence
653
+ - [ ] Maximum 10 variations tested per round (performance limit)
654
+
655
+ **Technical Specification:**
656
+ ```python
657
+ # Arena Tournament Flow
658
+ class ArenaTournament:
659
+ def run_round(self, variations, test_input, weights):
660
+ # 1. Execute all variations
661
+ outputs = []
662
+ for var in variations:
663
+ output = execute_skill(var, test_input)
664
+ outputs.append({
665
+ "variation": var.id,
666
+ "output": output,
667
+ "exec_time": elapsed,
668
+ "tokens": token_count
669
+ })
670
+
671
+ # 2. Pairwise comparisons
672
+ comparisons = []
673
+ for i, out_a in enumerate(outputs):
674
+ for j, out_b in enumerate(outputs[i+1:]):
675
+ # Randomize order to mitigate position bias
676
+ order = random.choice(['AB', 'BA'])
677
+
678
+ result = judge.compare(
679
+ output_a=out_a if order=='AB' else out_b,
680
+ output_b=out_b if order=='AB' else out_a,
681
+ weights=weights,
682
+ criteria=["completeness", "clarity", "quality", "efficiency"]
683
+ )
684
+
685
+ comparisons.append({
686
+ "pair": (out_a.variation, out_b.variation),
687
+ "winner": result.winner,
688
+ "reasoning": result.reasoning,
689
+ "scores": result.dimensional_scores
690
+ })
691
+
692
+ # 3. Bradley-Terry ranking
693
+ rankings = bradley_terry_rank(comparisons)
694
+
695
+ # 4. Select winner
696
+ winner = rankings[0]
697
+
698
+ return {
699
+ "winner": winner,
700
+ "rankings": rankings,
701
+ "outputs": outputs,
702
+ "comparisons": comparisons
703
+ }
704
+ ```
705
+
706
+ **Task Breakdown:**
707
+ - Implement skill variation generator: Large (10h)
708
+ - Build skill execution sandbox: Medium (8h)
709
+ - Create output capture system: Small (4h)
710
+ - Implement pairwise judge with bias mitigation: Large (12h)
711
+ - Add Bradley-Terry ranking: Medium (6h)
712
+ - Build tournament loop: Medium (8h)
713
+ - Write comprehensive tests: Large (10h)
714
+
715
+ **Dependencies:** R7 (Skill Execution), R8 (LLM-as-Judge), Arena-Lite algorithm
716
+
717
+ ---
718
+
719
+ #### REQ-006: LLM-as-Judge with Weighted Evaluation
720
+
721
+ **Description:** Separate judge model evaluates real skill outputs using user-configured weighted criteria with chain-of-thought reasoning.
722
+
723
+ **Acceptance Criteria:**
724
+ - [ ] Judge model separate from skill execution model (avoid self-evaluation bias)
725
+ - [ ] Judges real outputs, NOT skill code
726
+ - [ ] Weighted dimensions: Completeness, Clarity, Quality, Efficiency (user-configured)
727
+ - [ ] Dimensional scores (0-100) with evidence from outputs
728
+ - [ ] Chain-of-thought reasoning before final score
729
+ - [ ] Anti-verbosity instructions (prefer concise accurate answers)
730
+ - [ ] Position bias mitigation (randomize output order)
731
+ - [ ] Store detailed reasoning for user review
732
+ - [ ] Integration with agentic weighting (failure mode analysis)
733
+
734
+ **Technical Specification:**
735
+ ```typescript
736
+ interface JudgeRequest {
737
+ outputA: string;
738
+ outputB: string;
739
+ weights: {
740
+ completeness: number; // 0-100, sums to 100
741
+ clarity: number;
742
+ quality: number;
743
+ efficiency: number;
744
+ };
745
+ domain: string;
746
+ }
747
+
748
+ interface JudgeResponse {
749
+ winner: "A" | "B" | "TIE";
750
+ reasoning: string; // Chain-of-thought
751
+ dimensionalScores: {
752
+ A: { completeness: number; clarity: number; quality: number; efficiency: number; };
753
+ B: { completeness: number; clarity: number; quality: number; efficiency: number; };
754
+ };
755
+ overallScores: {
756
+ A: number; // Weighted average
757
+ B: number;
758
+ };
759
+ evidence: {
760
+ dimension: string;
761
+ winner: "A" | "B";
762
+ example: string; // Specific quote from output
763
+ }[];
764
+ }
765
+
766
+ // Example Judge Prompt
767
+ const judgePrompt = `
768
+ You are an expert evaluator for ${domain} skills.
769
+
770
+ Compare these two outputs for the task: "${testInput}"
771
+
772
+ Output A:
773
+ ${outputA}
774
+
775
+ Output B:
776
+ ${outputB}
777
+
778
+ Evaluation Criteria (weights):
779
+ - Completeness (${weights.completeness}%): Does it address all requirements?
780
+ - Clarity (${weights.clarity}%): Is it clear and understandable?
781
+ - Quality (${weights.quality}%): Is it high quality and detailed?
782
+ - Efficiency (${weights.efficiency}%): Is it concise without unnecessary verbosity?
783
+
784
+ IMPORTANT: Prefer concise, accurate answers over verbose ones.
785
+
786
+ Step 1: Analyze each dimension
787
+ [Your chain-of-thought reasoning]
788
+
789
+ Step 2: Score each dimension (0-100)
790
+ [Dimensional scores with evidence]
791
+
792
+ Step 3: Calculate weighted overall score
793
+ [Final scores]
794
+
795
+ Winner: [A/B/TIE]
796
+ `;
797
+ ```
798
+
799
+ **Task Breakdown:**
800
+ - Implement separate judge model call: Small (3h)
801
+ - Build weighted evaluation logic: Medium (6h)
802
+ - Add chain-of-thought prompting: Small (4h)
803
+ - Implement dimensional scoring: Medium (6h)
804
+ - Add evidence extraction: Medium (5h)
805
+ - Build position bias mitigation: Small (3h)
806
+ - Write judge accuracy tests: Medium (8h)
807
+
808
+ **Dependencies:** LLM API access (Opus for judging), agentic weighting integration
809
+
810
+ ---
811
+
812
+ #### REQ-007: Convergence Detection (Multi-Criteria)
813
+
814
+ **Description:** Stop arena when ANY stopping condition met: score plateau, time limit, iteration limit, or target achieved.
815
+
816
+ **Acceptance Criteria:**
817
+ - [ ] Score plateau: Improvement < 2% for 3 consecutive rounds
818
+ - [ ] Time limit: Elapsed time > MAX_TIME (adaptive: 10min simple, 25min moderate, 45min complex)
819
+ - [ ] Iteration limit: Rounds > MAX_ROUNDS (adaptive: 3-10 based on complexity)
820
+ - [ ] Target achieved: Score >= TARGET_SCORE (e.g., 95/100)
821
+ - [ ] User interruption: Manual stop by user
822
+ - [ ] Log convergence reason for transparency
823
+ - [ ] Early stopping prevents wasted computation
824
+
825
+ **Technical Specification:**
826
+ ```python
827
+ class ConvergenceDetector:
828
+ def should_stop(self, history, elapsed, complexity):
829
+ # Adaptive limits based on complexity
830
+ MAX_TIME = {1-3: 10, 4-7: 25, 8-10: 45}[complexity] # minutes
831
+ MAX_ROUNDS = {1-3: 3, 4-7: 5, 8-10: 10}[complexity]
832
+ TARGET_SCORE = 95
833
+
834
+ # Criterion 1: Score plateau
835
+ if len(history) >= 3:
836
+ recent = history[-3:]
837
+ improvements = [
838
+ recent[i].score - recent[i-1].score
839
+ for i in range(1, 3)
840
+ ]
841
+ if all(imp < 0.02 for imp in improvements):
842
+ return True, "Score plateau: < 2% improvement for 3 rounds"
843
+
844
+ # Criterion 2: Time limit
845
+ if elapsed > MAX_TIME * 60:
846
+ return True, f"Time limit reached ({MAX_TIME} min)"
847
+
848
+ # Criterion 3: Iteration limit
849
+ if len(history) >= MAX_ROUNDS:
850
+ return True, f"Max rounds reached ({MAX_ROUNDS})"
851
+
852
+ # Criterion 4: Target achieved
853
+ if history[-1].score >= TARGET_SCORE:
854
+ return True, f"Target score achieved ({TARGET_SCORE})"
855
+
856
+ # Criterion 5: User interruption (checked elsewhere)
857
+
858
+ return False, None
859
+ ```
860
+
861
+ **Task Breakdown:**
862
+ - Implement multi-criteria convergence logic: Medium (6h)
863
+ - Add adaptive limits based on complexity: Small (3h)
864
+ - Build user interruption handling: Small (3h)
865
+ - Add convergence logging: Small (2h)
866
+ - Write convergence tests: Small (4h)
867
+
868
+ **Dependencies:** R14 (Adaptive Complexity)
869
+
870
+ ---
871
+
872
+ #### REQ-008: Pinecone Collective Database Integration
873
+
874
+ **Description:** HTTP API for skill storage, search, and feedback with Pinecone vector database backend (no MCP installation required).
875
+
876
+ **Acceptance Criteria:**
877
+ - [ ] Direct HTTP API calls via Bash curl (no MCP to install)
878
+ - [ ] POST /api/collective/search - Query for matching skills
879
+ - [ ] POST /api/collective/submit - Submit winning skill
880
+ - [ ] POST /api/collective/feedback - Submit user rating
881
+ - [ ] GET /api/collective/leaderboard - Top skills by domain
882
+ - [ ] Pinecone schema includes: skill_id, embedding (1536-dim), metadata (scores, weights, feedback, lineage, ELO, usage stats)
883
+ - [ ] Authentication for submissions (API key)
884
+ - [ ] Rate limiting (100 requests/min per user)
885
+ - [ ] Response time < 2s for search queries
886
+
887
+ **Technical Specification:**
888
+ ```typescript
889
+ // Pinecone Schema
890
+ interface SkillVector {
891
+ id: string; // skill_id
892
+ values: number[]; // 1536-dim embedding
893
+ metadata: {
894
+ // Basic info
895
+ name: string;
896
+ domain: string;
897
+ skill_content: string; // Full SKILL.md
898
+
899
+ // Dimensional scores
900
+ scores: {
901
+ completeness: number;
902
+ clarity: number;
903
+ quality: number;
904
+ efficiency: number;
905
+ overall: number;
906
+ };
907
+
908
+ // Impact-based weights used
909
+ weights: {
910
+ completeness: number;
911
+ clarity: number;
912
+ quality: number;
913
+ efficiency: number;
914
+ reasoning: string;
915
+ };
916
+
917
+ // Real-world user feedback
918
+ feedback: {
919
+ avg_rating: number;
920
+ total_ratings: number;
921
+ success_rate: number;
922
+ recent_comments: string[];
923
+ };
924
+
925
+ // Lineage & evolution
926
+ generation: number;
927
+ parent_id: string;
928
+ improvement_pct: number;
929
+
930
+ // Rankings
931
+ elo_rating: number;
932
+ leaderboard_rank: number;
933
+
934
+ // Usage stats
935
+ usage_count: number;
936
+ last_used: string; // ISO 8601
937
+
938
+ // Test data
939
+ test_scenarios: string[];
940
+ arena_results_url: string;
941
+ };
942
+ }
943
+
944
+ // API Endpoints
945
+ POST /api/collective/search
946
+ {
947
+ "query": "comprehensive PRD skill",
948
+ "domain": "prd-generation",
949
+ "top_k": 5
950
+ }
951
+ → { "matches": [...], "queryTime": 1.2 }
952
+
953
+ POST /api/collective/submit
954
+ {
955
+ "skill_content": "...",
956
+ "scores": {...},
957
+ "weights": {...},
958
+ "parent_id": "uuid-123",
959
+ "test_scenarios": [...]
960
+ }
961
+ → { "skillId": "uuid-456", "rank": 12 }
962
+
963
+ POST /api/collective/feedback
964
+ {
965
+ "skill_id": "uuid-456",
966
+ "rating": 5,
967
+ "comment": "Excellent PRD generation",
968
+ "success": true
969
+ }
970
+ → { "updated": true }
971
+
972
+ GET /api/collective/leaderboard?domain=prd-generation&limit=10
973
+ → { "skills": [...], "lastUpdated": "2025-01-22T10:00:00Z" }
974
+ ```
975
+
976
+ **Task Breakdown:**
977
+ - Set up Pinecone database and schema: Medium (6h)
978
+ - Implement POST /search endpoint: Medium (8h)
979
+ - Implement POST /submit endpoint: Medium (8h)
980
+ - Implement POST /feedback endpoint: Small (4h)
981
+ - Implement GET /leaderboard endpoint: Small (4h)
982
+ - Add authentication and rate limiting: Medium (6h)
983
+ - Write API integration tests: Medium (8h)
984
+
985
+ **Dependencies:** Pinecone account, OpenAI embeddings API
986
+
987
+ ---
988
+
989
+ ### Should Have (P1) - Important for Full Experience
990
+
991
+ #### REQ-009: User-in-the-Loop Validation
992
+
993
+ **Description:** Store arena results locally, allow users to review outputs and score them manually for validation and feedback.
994
+
995
+ **Acceptance Criteria:**
996
+ - [ ] All arena results stored in `.claude/skills/[skill-name]/arena_results/`
997
+ - [ ] Format: JSON with inputs, outputs, scores, judge reasoning
998
+ - [ ] User can browse results after arena completes
999
+ - [ ] Side-by-side output comparison UI
1000
+ - [ ] User can score each output (1-5 stars)
1001
+ - [ ] User scores submitted to database (opt-in)
1002
+ - [ ] Feedback improves future arena weights
1003
+
1004
+ **Task Breakdown:**
1005
+ - Create local storage system: Small (4h)
1006
+ - Build results browsing UI: Medium (8h)
1007
+ - Implement comparison view: Medium (6h)
1008
+ - Add user scoring interface: Small (4h)
1009
+ - Build feedback submission: Small (3h)
1010
+ - Write validation tests: Small (4h)
1011
+
1012
+ **Dependencies:** R12 (User-in-the-Loop Validation), Pinecone feedback API
1013
+
1014
+ ---
1015
+
1016
+ #### REQ-010: Adaptive Tournament Sizing
1017
+
1018
+ **Description:** Tournament size (variations, rounds) adapts to skill complexity for optimal time/quality trade-off.
1019
+
1020
+ **Acceptance Criteria:**
1021
+ - [ ] Simple skills (1-3): 3 variations, 3 rounds, ~10 min
1022
+ - [ ] Moderate skills (4-7): 5 variations, 5 rounds, ~25 min
1023
+ - [ ] Complex skills (8-10): 7 variations, 7 rounds, ~45 min
1024
+ - [ ] Complexity auto-detected from user request and domain
1025
+ - [ ] User can override default sizing
1026
+ - [ ] Adaptive convergence limits (time, iterations)
1027
+
1028
+ **Task Breakdown:**
1029
+ - Implement complexity detection: Medium (6h)
1030
+ - Build adaptive sizing logic: Small (4h)
1031
+ - Add user override option: Small (2h)
1032
+ - Write adaptive tests: Small (4h)
1033
+
1034
+ **Dependencies:** R14 (Adaptive Complexity)
1035
+
1036
+ ---
1037
+
1038
+ #### REQ-011: Skill Lineage Tracking
1039
+
1040
+ **Description:** Track skill evolution over time (parent → child relationships, improvement percentages, generation numbers).
1041
+
1042
+ **Acceptance Criteria:**
1043
+ - [ ] Each skill stores parent_id in metadata
1044
+ - [ ] Generation number auto-increments (parent.gen + 1)
1045
+ - [ ] Improvement percentage calculated vs parent
1046
+ - [ ] Can trace ancestry (skill → parent → grandparent → ...)
1047
+ - [ ] Lineage displayed in search results and leaderboard
1048
+
1049
+ **Task Breakdown:**
1050
+ - Add lineage fields to schema: Small (2h)
1051
+ - Implement ancestry tracking: Small (4h)
1052
+ - Build lineage display UI: Small (4h)
1053
+ - Write lineage tests: Small (3h)
1054
+
1055
+ **Dependencies:** Pinecone schema update
1056
+
1057
+ ---
1058
+
1059
+ ### Nice to Have (P2) - Future Enhancement
1060
+
1061
+ #### REQ-012: Test Scenario Evolution Across Rounds
1062
+
1063
+ **Description:** Tests get progressively harder across arena rounds (simple → complex → edge case).
1064
+
1065
+ **Acceptance Criteria:**
1066
+ - [ ] Round 1: Simple, common use case
1067
+ - [ ] Round 2: Complex, multi-part scenario
1068
+ - [ ] Round 3: Edge case, unusual constraints
1069
+ - [ ] Winner must excel at all difficulty levels
1070
+
1071
+ **Task Breakdown:**
1072
+ - Implement scenario difficulty progression: Medium (6h)
1073
+ - Write evolution tests: Small (4h)
1074
+
1075
+ **Dependencies:** R5 (Realistic Test Data)
1076
+
1077
+ ---
1078
+
1079
+ #### REQ-013: Multi-Model Judge Ensemble
1080
+
1081
+ **Description:** Use multiple judge models (e.g., Opus + Sonnet) and aggregate results for more reliable judgments.
1082
+
1083
+ **Acceptance Criteria:**
1084
+ - [ ] Run same comparison with 2+ judge models
1085
+ - [ ] Aggregate results (majority vote or average scores)
1086
+ - [ ] Higher confidence when judges agree
1087
+ - [ ] Flag for human review when judges disagree
1088
+
1089
+ **Task Breakdown:**
1090
+ - Implement multi-model judging: Medium (8h)
1091
+ - Build aggregation logic: Small (4h)
1092
+ - Add disagreement detection: Small (3h)
1093
+
1094
+ **Dependencies:** Access to multiple LLM models
1095
+
1096
+ ---
1097
+
1098
+ ## Non-Functional Requirements
1099
+
1100
+ ### Performance
1101
+
1102
+ **Response Time:**
1103
+ - Database search queries: < 2 seconds (p95)
1104
+ - Base skill (v0.1) generation: < 30 seconds total
1105
+ - Arena completion: < 30 minutes for moderate skills (p95)
1106
+ - LLM judge comparison: < 10 seconds per pairwise comparison
1107
+
1108
+ **Throughput:**
1109
+ - Support 100 concurrent users creating skills
1110
+ - Database can handle 1000 queries/hour
1111
+ - Arena can run 10 background jobs concurrently
1112
+
1113
+ **Resource Usage:**
1114
+ - Local storage: < 100MB per skill (arena results)
1115
+ - Memory: < 2GB for background arena process
1116
+ - Network: Minimize API calls (batch where possible)
1117
+
1118
+ ---
1119
+
1120
+ ### Security
1121
+
1122
+ **Authentication:**
1123
+ - API key required for database submissions
1124
+ - User ID anonymized in database (hash)
1125
+ - No PII stored in collective database
1126
+
1127
+ **Data Protection:**
1128
+ - Skill content public (stored in database)
1129
+ - User inputs hashed (not stored in plaintext)
1130
+ - Output samples truncated (500 chars max)
1131
+ - Local arena results private (stored on user machine)
1132
+
1133
+ **Privacy:**
1134
+ - Opt-in for database submission
1135
+ - Opt-in for feedback collection
1136
+ - YAML flag: `collect-feedback: true/false`
1137
+ - Can disable via setting
1138
+
1139
+ ---
1140
+
1141
+ ### Scalability
1142
+
1143
+ **User Load:**
1144
+ - Support 1000 active users in first 3 months
1145
+ - Scale to 10,000 users within 1 year
1146
+ - Serverless API (auto-scales)
1147
+
1148
+ **Database Volume:**
1149
+ - Initial: 100 skills
1150
+ - Growth: 100-500 skills/month
1151
+ - Storage: 1GB vector database initially
1152
+ - Pinecone free tier: 1 index, 100k vectors (sufficient for MVP)
1153
+
1154
+ **Arena Jobs:**
1155
+ - 10 concurrent background jobs per user machine
1156
+ - Each job runs 10-45 min
1157
+ - State persistence allows resume
1158
+
1159
+ ---
1160
+
1161
+ ### Reliability
1162
+
1163
+ **Uptime:**
1164
+ - API SLA: 99% monthly uptime
1165
+ - Graceful degradation if database unavailable (create skills without search)
1166
+ - Resume capability if arena interrupted
1167
+
1168
+ **Error Handling:**
1169
+ - Retry logic for API failures (3 retries with exponential backoff)
1170
+ - Timeout for skill execution (5 min max per skill)
1171
+ - Fallback to defaults if web search fails
1172
+
1173
+ **Monitoring:**
1174
+ - Track arena completion rate
1175
+ - Alert on high failure rate (> 10%)
1176
+ - Log all convergence reasons
1177
+
1178
+ ---
1179
+
1180
+ ### Compatibility
1181
+
1182
+ **Claude Code Version:**
1183
+ - Requires Claude Code with Task tool support
1184
+ - Compatible with latest stable release
1185
+
1186
+ **System Requirements:**
1187
+ - Linux, macOS, Windows (WSL2)
1188
+ - Internet connection for API calls
1189
+ - Disk space: 1GB free for arena results
1190
+
1191
+ **Dependencies:**
1192
+ - Bash (for curl API calls)
1193
+ - No MCP installation required
1194
+ - No additional software needed
1195
+
1196
+ ---
1197
+
1198
+ ## Technical Considerations
1199
+
1200
+ ### System Architecture
1201
+
1202
+ **Current Architecture:**
1203
+ Claude Code skill system with:
1204
+ - Skills stored in `~/.claude/skills/[skill-name]/SKILL.md`
1205
+ - YAML frontmatter for metadata
1206
+ - Activation via triggers in user messages
1207
+ - No optimization or validation currently
1208
+
1209
+ **Proposed Architecture:**
1210
+
1211
+ ```
1212
+ ┌─────────────────────────────────────────────────────────────┐
1213
+ │ Claude Code (User Machine) │
1214
+ │ │
1215
+ │ ┌──────────────────────────────────────────────────────┐ │
1216
+ │ │ skill_creating Skill (Enhanced) │ │
1217
+ │ │ ┌────────────────────────────────────────────────┐ │ │
1218
+ │ │ │ 1. Database-First Query │ │ │
1219
+ │ │ │ └─> Pinecone Search API │ │ │
1220
+ │ │ └────────────────────────────────────────────────┘ │ │
1221
+ │ │ ┌────────────────────────────────────────────────┐ │ │
1222
+ │ │ │ 2. Quick Base Generation (v0.1) │ │ │
1223
+ │ │ │ ├─> Pattern Research (awesome-claude) │ │ │
1224
+ │ │ │ ├─> Domain Research (WebSearch) │ │ │
1225
+ │ │ │ ├─> Question Generator Agent │ │ │
1226
+ │ │ │ └─> Base Skill Template │ │ │
1227
+ │ │ └────────────────────────────────────────────────┘ │ │
1228
+ │ │ ┌────────────────────────────────────────────────┐ │ │
1229
+ │ │ │ 3. Background Arena (via Task tool) │ │ │
1230
+ │ │ │ │ │ │
1231
+ │ │ │ Orchestrator Agent │ │ │
1232
+ │ │ │ ├─> Deep Research Agent │ │ │
1233
+ │ │ │ ├─> Test Data Generator Agent │ │ │
1234
+ │ │ │ │ └─> WebSearch (realistic examples) │ │ │
1235
+ │ │ │ ├─> Variation Generator Agent │ │ │
1236
+ │ │ │ ├─> Execution Workers (parallel) │ │ │
1237
+ │ │ │ │ ├─ Worker A: Execute Skill A │ │ │
1238
+ │ │ │ │ ├─ Worker B: Execute Skill B │ │ │
1239
+ │ │ │ │ └─ Worker C: Execute Skill C │ │ │
1240
+ │ │ │ ├─> Judge Agent (LLM-as-judge) │ │ │
1241
+ │ │ │ │ └─> Opus (separate from exec) │ │ │
1242
+ │ │ │ └─> Synthesis Agent (Bradley-Terry) │ │ │
1243
+ │ │ │ │ │ │
1244
+ │ │ │ State Persistence: Local JSON │ │ │
1245
+ │ │ │ Arena Results: .claude/skills/.../arena... │ │ │
1246
+ │ │ └────────────────────────────────────────────────┘ │ │
1247
+ │ │ ┌────────────────────────────────────────────────┐ │ │
1248
+ │ │ │ 4. User Validation & Submission │ │ │
1249
+ │ │ │ ├─> Review UI (compare outputs) │ │ │
1250
+ │ │ │ ├─> User Scoring (1-5 stars) │ │ │
1251
+ │ │ │ └─> Submit to Collective (opt-in) │ │ │
1252
+ │ │ └────────────────────────────────────────────────┘ │ │
1253
+ │ └──────────────────────────────────────────────────────┘ │
1254
+ └─────────────────────────────────────────────────────────────┘
1255
+
1256
+ │ HTTP API (curl)
1257
+
1258
+ ┌─────────────────────────────────────────────────────────────┐
1259
+ │ Collective API (Serverless) │
1260
+ │ │
1261
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
1262
+ │ │ POST /search │ │ POST /submit │ │ POST /feedback│ │
1263
+ │ └──────┬───────┘ └──────┬───────┘ └──────┬────────┘ │
1264
+ │ │ │ │ │
1265
+ │ └──────────────────┼──────────────────┘ │
1266
+ │ ▼ │
1267
+ │ ┌────────────────────────────────────────────────────┐ │
1268
+ │ │ Pinecone Vector Database │ │
1269
+ │ │ ┌──────────────────────────────────────────────┐ │ │
1270
+ │ │ │ Index: claude-skills │ │ │
1271
+ │ │ │ - Embeddings (1536-dim) │ │ │
1272
+ │ │ │ - Metadata (scores, weights, feedback, ELO) │ │ │
1273
+ │ │ │ - Semantic search │ │ │
1274
+ │ │ └──────────────────────────────────────────────┘ │ │
1275
+ │ └────────────────────────────────────────────────────┘ │
1276
+ │ │
1277
+ │ ┌──────────────┐ ┌──────────────┐ │
1278
+ │ │ GET /leaderboard│ │ Auth & Rate │ │
1279
+ │ │ │ │ Limiting │ │
1280
+ │ └──────────────┘ └──────────────┘ │
1281
+ └─────────────────────────────────────────────────────────────┘
1282
+
1283
+ │ OpenAI API
1284
+
1285
+ ┌──────────────────┐
1286
+ │ OpenAI Embeddings│
1287
+ │ text-embed-3-small│
1288
+ └──────────────────┘
1289
+ ```
1290
+
1291
+ **Key Components:**
1292
+
1293
+ 1. **skill_creating Skill (Enhanced):**
1294
+ - Main entry point for user requests
1295
+ - Orchestrates entire workflow
1296
+ - Uses Task tool for background jobs
1297
+
1298
+ 2. **Agentic Workers:**
1299
+ - **Question Generator:** Domain-specific questions
1300
+ - **Research Agents:** Quick (30s) + Deep (background)
1301
+ - **Test Data Generator:** Realistic scenarios via web search
1302
+ - **Variation Generator:** Creates competing skill variations
1303
+ - **Execution Workers:** Run skills in isolation (parallel)
1304
+ - **Judge Agent:** LLM-as-judge with weighted criteria
1305
+ - **Synthesis Agent:** Bradley-Terry ranking, convergence detection
1306
+
1307
+ 3. **Collective API:**
1308
+ - Serverless HTTP API (no MCP required)
1309
+ - Pinecone vector database backend
1310
+ - Authentication, rate limiting
1311
+ - Leaderboard and feedback system
1312
+
1313
+ 4. **Local Storage:**
1314
+ - Arena results: `.claude/skills/[skill-name]/arena_results/`
1315
+ - State persistence for resume capability
1316
+ - User can review and validate
1317
+
1318
+ ---
1319
+
1320
+ ### API Specifications
1321
+
1322
+ See REQ-008 for detailed API specs.
1323
+
1324
+ **Key Endpoints:**
1325
+ - `POST /api/collective/search` - Query skills
1326
+ - `POST /api/collective/submit` - Submit skill
1327
+ - `POST /api/collective/feedback` - Submit rating
1328
+ - `GET /api/collective/leaderboard` - Top skills
1329
+
1330
+ **Authentication:**
1331
+ ```bash
1332
+ curl -X POST https://api.collective.claude-skills.ai/search \
1333
+ -H "Authorization: Bearer ${API_KEY}" \
1334
+ -H "Content-Type: application/json" \
1335
+ -d '{"query": "PRD skill", "domain": "prd-generation"}'
1336
+ ```
1337
+
1338
+ ---
1339
+
1340
+ ### Technology Stack
1341
+
1342
+ **Frontend (User-Facing):**
1343
+ - Claude Code CLI interface
1344
+ - Text-based UI for questions, results, comparison
1345
+ - Markdown formatting for output
1346
+
1347
+ **Backend (Skill Logic):**
1348
+ - Language: Integrated with Claude Code (no separate backend)
1349
+ - Agentic orchestration: Claude Code Task tool
1350
+ - State management: Local JSON files
1351
+ - LLM calls: Anthropic API (Sonnet for execution, Opus for judging)
1352
+
1353
+ **Database:**
1354
+ - Vector DB: Pinecone (free tier for MVP)
1355
+ - Embeddings: OpenAI text-embedding-3-small (1536 dimensions)
1356
+ - Storage format: JSON metadata + vector embeddings
1357
+
1358
+ **Infrastructure:**
1359
+ - Execution: User's machine (local)
1360
+ - API: Serverless (Vercel, AWS Lambda, or Cloudflare Workers)
1361
+ - No additional installation required (uses existing Claude Code)
1362
+
1363
+ **External Dependencies:**
1364
+ - WebSearch tool (Claude Code built-in)
1365
+ - WebFetch tool (Claude Code built-in)
1366
+ - Anthropic API (Claude Sonnet, Opus)
1367
+ - OpenAI Embeddings API
1368
+ - Pinecone API
1369
+
1370
+ ---
1371
+
1372
+ ### External Dependencies
1373
+
1374
+ **Third-Party Services:**
1375
+
1376
+ 1. **Pinecone Vector Database:**
1377
+ - Purpose: Collective skill storage and semantic search
1378
+ - API: https://docs.pinecone.io/
1379
+ - Rate Limits: 100 requests/second (free tier)
1380
+ - Fallback: If down, create skills without search (graceful degradation)
1381
+ - Cost: Free tier (1 index, 100k vectors)
1382
+
1383
+ 2. **OpenAI Embeddings API:**
1384
+ - Purpose: Generate 1536-dim embeddings for semantic search
1385
+ - Model: text-embedding-3-small
1386
+ - Rate Limits: 3000 requests/min
1387
+ - Fallback: Cache embeddings, retry on failure
1388
+ - Cost: $0.02 per 1M tokens (~$0.0001 per skill)
1389
+
1390
+ 3. **Anthropic API:**
1391
+ - Purpose: LLM calls for skill execution and judging
1392
+ - Models: Sonnet (execution), Opus (judging)
1393
+ - Rate Limits: Per user API key
1394
+ - Fallback: Use Sonnet for judging if Opus unavailable
1395
+ - Cost: User pays via their own API key
1396
+
1397
+ **Internal Dependencies:**
1398
+ - **Claude Code Task tool:** Required for background execution
1399
+ - **WebSearch tool:** For realistic test scenario discovery
1400
+ - **WebFetch tool:** For pattern research (awesome-claude-skills)
1401
+ - **Bash tool:** For curl API calls to collective database
1402
+
1403
+ ---
1404
+
1405
+ ### Migration Strategy
1406
+
1407
+ **For Existing skill_creating Skill:**
1408
+
1409
+ 1. **Phase 1: Enhance Existing Skill**
1410
+ - Add database-first query to SKILL.md workflow
1411
+ - Implement quick base generation (v0.1)
1412
+ - No breaking changes to current functionality
1413
+
1414
+ 2. **Phase 2: Add Background Arena**
1415
+ - Implement arena orchestration via Task tool
1416
+ - Runs after v0.1 delivery (non-blocking)
1417
+ - Opt-in initially (user can skip arena)
1418
+
1419
+ 3. **Phase 3: Collective Database Integration**
1420
+ - Deploy serverless API
1421
+ - Set up Pinecone database
1422
+ - Enable submissions
1423
+
1424
+ 4. **Phase 4: Gradual Feature Rollout**
1425
+ - Week 1-2: Database search only
1426
+ - Week 3-4: Base generation + optional arena
1427
+ - Week 5-6: Full arena with convergence
1428
+ - Week 7-8: User validation and feedback
1429
+
1430
+ 5. **Phase 5: Optimize Based on Usage**
1431
+ - Monitor arena completion rates
1432
+ - Adjust convergence criteria
1433
+ - Improve test data generation
1434
+
1435
+ **Rollback Plan:**
1436
+ - Disable database queries (fall back to current behavior)
1437
+ - Skip arena optimization (deploy v0.1 only)
1438
+ - Feature flags control each component independently
1439
+
1440
+ ---
1441
+
1442
+ ### Testing Strategy
1443
+
1444
+ **Unit Tests:**
1445
+ - Test coverage: > 80% for new code
1446
+ - Key areas:
1447
+ - Requirement fingerprint generation
1448
+ - Question generation by domain
1449
+ - Skill variation creation
1450
+ - Pairwise comparison logic
1451
+ - Bradley-Terry ranking
1452
+ - Convergence detection
1453
+ - API request/response handling
1454
+
1455
+ **Integration Tests:**
1456
+ - Full workflow tests:
1457
+ - Database query → results display → user selection
1458
+ - Quick research → base skill → deployment
1459
+ - Arena orchestration → workers → judge → synthesis
1460
+ - Local storage → user review → feedback submission
1461
+
1462
+ **E2E Tests:**
1463
+ - User journeys:
1464
+ - New user creates PRD skill (finds existing, chooses it)
1465
+ - User builds custom skill (v0.1 → arena → v1.0)
1466
+ - User reviews arena outputs and scores them
1467
+ - User submits winning skill to collective
1468
+ - Another user discovers submitted skill via search
1469
+
1470
+ **Performance Tests:**
1471
+ - Base skill generation completes < 30s (95th percentile)
1472
+ - Database query completes < 2s (95th percentile)
1473
+ - Arena completes < 30 min for moderate skills (95th percentile)
1474
+ - API endpoints respond < 2s (95th percentile)
1475
+
1476
+ **Quality Tests:**
1477
+ - Realistic test scenarios validated by domain experts
1478
+ - Judge consistency: Same comparison run 3x should agree ≥ 80%
1479
+ - Score improvement: v1.0 beats v0.1 in ≥ 80% of cases
1480
+ - User satisfaction: ≥ 4.5/5 average rating
1481
+
1482
+ **Security Tests:**
1483
+ - API authentication required for submissions
1484
+ - Rate limiting prevents abuse
1485
+ - No PII leaked in database
1486
+ - Anonymous user IDs cannot be reversed
1487
+
1488
+ ---
1489
+
1490
+ ## Implementation Roadmap
1491
+
1492
+ ### Phase 1: Foundation (Weeks 1-3)
1493
+
1494
+ **Goal:** Database setup, basic API, requirement extraction, quick research
1495
+
1496
+ **Tasks:**
1497
+
1498
+ - [ ] **Task 1.1:** Set up Pinecone database and schema
1499
+ - Complexity: Medium (6h)
1500
+ - Dependencies: None
1501
+ - Owner: Backend team
1502
+ - Deliverable: Pinecone index created, schema documented
1503
+
1504
+ - [ ] **Task 1.2:** Implement OpenAI embeddings generation
1505
+ - Complexity: Small (3h)
1506
+ - Dependencies: Task 1.1
1507
+ - Owner: Backend team
1508
+ - Deliverable: Function to generate 1536-dim vectors
1509
+
1510
+ - [ ] **Task 1.3:** Build POST /search API endpoint
1511
+ - Complexity: Medium (8h)
1512
+ - Dependencies: Tasks 1.1, 1.2
1513
+ - Owner: Backend team
1514
+ - Deliverable: Working semantic search API
1515
+
1516
+ - [ ] **Task 1.4:** Implement requirement extraction and fingerprinting
1517
+ - Complexity: Medium (6h)
1518
+ - Dependencies: None
1519
+ - Owner: Agent team
1520
+ - Deliverable: Parse user requests into structured requirements
1521
+
1522
+ - [ ] **Task 1.5:** Build quick pattern research (awesome-claude-skills scan)
1523
+ - Complexity: Small (4h)
1524
+ - Dependencies: None
1525
+ - Owner: Agent team
1526
+ - Deliverable: Quick scan returns relevant patterns in <10s
1527
+
1528
+ - [ ] **Task 1.6:** Build quick domain research (WebSearch integration)
1529
+ - Complexity: Medium (6h)
1530
+ - Dependencies: None
1531
+ - Owner: Agent team
1532
+ - Deliverable: Web search returns best practices in <15s
1533
+
1534
+ **Validation Checkpoint:**
1535
+ - [ ] Can query database and get relevant results in < 2s
1536
+ - [ ] Can extract requirements from user request
1537
+ - [ ] Quick research completes in < 20s total
1538
+ - [ ] Unit tests passing for all components
1539
+
1540
+ **Total Effort:** ~33 hours (~1.5 weeks with 2-person team)
1541
+
1542
+ ---
1543
+
1544
+ ### Phase 2: Quick Base Generation (Weeks 4-5)
1545
+
1546
+ **Goal:** Question generation, base skill creation, v0.1 deployment
1547
+
1548
+ **Tasks:**
1549
+
1550
+ - [ ] **Task 2.1:** Build question generator agent
1551
+ - Complexity: Medium (8h)
1552
+ - Dependencies: Phase 1 complete (uses quick research)
1553
+ - Owner: Agent team
1554
+ - Deliverable: Generates 3-7 domain-specific questions
1555
+
1556
+ - [ ] **Task 2.2:** Implement answer-to-weight conversion
1557
+ - Complexity: Medium (5h)
1558
+ - Dependencies: Task 2.1
1559
+ - Owner: Agent team
1560
+ - Deliverable: User answers → weighted criteria (e.g., Quality: 64%)
1561
+
1562
+ - [ ] **Task 2.3:** Create base skill template engine
1563
+ - Complexity: Medium (6h)
1564
+ - Dependencies: Quick research, questions
1565
+ - Owner: Agent team
1566
+ - Deliverable: Generates SKILL.md from inputs
1567
+
1568
+ - [ ] **Task 2.4:** Implement skill deployment automation
1569
+ - Complexity: Small (3h)
1570
+ - Dependencies: Task 2.3
1571
+ - Owner: Agent team
1572
+ - Deliverable: Writes SKILL.md to `.claude/skills/[name]/`
1573
+
1574
+ - [ ] **Task 2.5:** Build database-first workflow integration
1575
+ - Complexity: Medium (8h)
1576
+ - Dependencies: Task 1.3, Task 2.4
1577
+ - Owner: Integration team
1578
+ - Deliverable: Full flow: query → show results → user select → deploy
1579
+
1580
+ - [ ] **Task 2.6:** Add optional quick scoring of v0.1
1581
+ - Complexity: Small (4h)
1582
+ - Dependencies: Task 2.4
1583
+ - Owner: Agent team
1584
+ - Deliverable: User can score v0.1 immediately (optional)
1585
+
1586
+ **Validation Checkpoint:**
1587
+ - [ ] Full base generation completes in < 30s (excluding user answer time)
1588
+ - [ ] Generated v0.1 skill is valid and activates correctly
1589
+ - [ ] Database query → user selection → deployment works end-to-end
1590
+ - [ ] Integration tests passing
1591
+
1592
+ **Total Effort:** ~34 hours (~1.5 weeks)
1593
+
1594
+ ---
1595
+
1596
+ ### Phase 3: Arena Core (Weeks 6-8)
1597
+
1598
+ **Goal:** Background orchestration, skill execution, realistic test data, pairwise judge
1599
+
1600
+ **Tasks:**
1601
+
1602
+ - [ ] **Task 3.1:** Implement orchestrator-worker pattern via Task tool
1603
+ - Complexity: Large (12h)
1604
+ - Dependencies: None (new subsystem)
1605
+ - Owner: Agent team
1606
+ - Deliverable: Orchestrator manages arena rounds in background
1607
+
1608
+ - [ ] **Task 3.2:** Build skill variation generator agent
1609
+ - Complexity: Large (10h)
1610
+ - Dependencies: Base skill template
1611
+ - Owner: Agent team
1612
+ - Deliverable: Creates 3 variations (A, B, C) from base
1613
+
1614
+ - [ ] **Task 3.3:** Build realistic test data generator agent
1615
+ - Complexity: Medium (8h)
1616
+ - Dependencies: WebSearch tool
1617
+ - Owner: Agent team
1618
+ - Deliverable: Generates realistic scenarios using personas
1619
+
1620
+ - [ ] **Task 3.4:** Implement skill execution sandbox
1621
+ - Complexity: Medium (8h)
1622
+ - Dependencies: None
1623
+ - Owner: Execution team
1624
+ - Deliverable: Isolated skill execution with timeout
1625
+
1626
+ - [ ] **Task 3.5:** Build output capture system
1627
+ - Complexity: Small (4h)
1628
+ - Dependencies: Task 3.4
1629
+ - Owner: Execution team
1630
+ - Deliverable: Captures full skill outputs (no truncation)
1631
+
1632
+ - [ ] **Task 3.6:** Implement LLM-as-judge with pairwise comparison
1633
+ - Complexity: Large (12h)
1634
+ - Dependencies: Weighted criteria from questions
1635
+ - Owner: Judge team
1636
+ - Deliverable: Compares 2 outputs, returns winner with reasoning
1637
+
1638
+ - [ ] **Task 3.7:** Add position bias mitigation (randomize order)
1639
+ - Complexity: Small (3h)
1640
+ - Dependencies: Task 3.6
1641
+ - Owner: Judge team
1642
+ - Deliverable: Multiple comparisons with order flipped
1643
+
1644
+ - [ ] **Task 3.8:** Implement Bradley-Terry ranking
1645
+ - Complexity: Medium (6h)
1646
+ - Dependencies: Task 3.6
1647
+ - Owner: Judge team
1648
+ - Deliverable: Ranks variations from pairwise results
1649
+
1650
+ **Validation Checkpoint:**
1651
+ - [ ] Can generate 3 variations from base skill
1652
+ - [ ] Can execute all variations with same test input
1653
+ - [ ] Judge compares outputs and selects winner
1654
+ - [ ] Ranking produces consistent results
1655
+ - [ ] Background orchestration runs without blocking user
1656
+
1657
+ **Total Effort:** ~63 hours (~3 weeks)
1658
+
1659
+ ---
1660
+
1661
+ ### Phase 4: Convergence & Iteration (Weeks 9-10)
1662
+
1663
+ **Goal:** Multi-round tournaments, convergence detection, state persistence
1664
+
1665
+ **Tasks:**
1666
+
1667
+ - [ ] **Task 4.1:** Build tournament iteration loop
1668
+ - Complexity: Medium (8h)
1669
+ - Dependencies: Phase 3 complete
1670
+ - Owner: Orchestrator team
1671
+ - Deliverable: Winner advances to next round vs 2 new variations
1672
+
1673
+ - [ ] **Task 4.2:** Implement convergence detection (multi-criteria)
1674
+ - Complexity: Medium (6h)
1675
+ - Dependencies: Tournament loop
1676
+ - Owner: Orchestrator team
1677
+ - Deliverable: Stops when score plateau, time limit, or target met
1678
+
1679
+ - [ ] **Task 4.3:** Add adaptive tournament sizing
1680
+ - Complexity: Small (4h)
1681
+ - Dependencies: Complexity detection
1682
+ - Owner: Orchestrator team
1683
+ - Deliverable: Simple (3 rounds), Moderate (5), Complex (7)
1684
+
1685
+ - [ ] **Task 4.4:** Build state persistence system
1686
+ - Complexity: Medium (6h)
1687
+ - Dependencies: Tournament loop
1688
+ - Owner: Orchestrator team
1689
+ - Deliverable: Can resume arena if interrupted
1690
+
1691
+ - [ ] **Task 4.5:** Implement graceful interruption handling
1692
+ - Complexity: Small (4h)
1693
+ - Dependencies: State persistence
1694
+ - Owner: Orchestrator team
1695
+ - Deliverable: User can stop arena, resume later
1696
+
1697
+ - [ ] **Task 4.6:** Build arena completion notification
1698
+ - Complexity: Small (3h)
1699
+ - Dependencies: Convergence detection
1700
+ - Owner: UI team
1701
+ - Deliverable: User notified when v1.0 ready
1702
+
1703
+ **Validation Checkpoint:**
1704
+ - [ ] Arena runs multiple rounds until convergence
1705
+ - [ ] Convergence criteria working correctly (no infinite loops)
1706
+ - [ ] Can interrupt and resume arena
1707
+ - [ ] Notification appears when complete
1708
+ - [ ] Arena completes in < 30 min for moderate skills
1709
+
1710
+ **Total Effort:** ~31 hours (~1.5 weeks)
1711
+
1712
+ ---
1713
+
1714
+ ### Phase 5: User Validation & Feedback (Weeks 11-12)
1715
+
1716
+ **Goal:** Local storage, review UI, user scoring, feedback collection
1717
+
1718
+ **Tasks:**
1719
+
1720
+ - [ ] **Task 5.1:** Create local arena results storage
1721
+ - Complexity: Small (4h)
1722
+ - Dependencies: Output capture
1723
+ - Owner: Storage team
1724
+ - Deliverable: JSON files in `.claude/skills/.../arena_results/`
1725
+
1726
+ - [ ] **Task 5.2:** Build results browsing UI
1727
+ - Complexity: Medium (8h)
1728
+ - Dependencies: Local storage
1729
+ - Owner: UI team
1730
+ - Deliverable: User can view all round results
1731
+
1732
+ - [ ] **Task 5.3:** Implement side-by-side output comparison
1733
+ - Complexity: Medium (6h)
1734
+ - Dependencies: Results browsing
1735
+ - Owner: UI team
1736
+ - Deliverable: Compare outputs from different variations
1737
+
1738
+ - [ ] **Task 5.4:** Add user scoring interface (1-5 stars)
1739
+ - Complexity: Small (4h)
1740
+ - Dependencies: Comparison UI
1741
+ - Owner: UI team
1742
+ - Deliverable: User can rate each output
1743
+
1744
+ - [ ] **Task 5.5:** Build version comparison (v0.1 vs v1.0)
1745
+ - Complexity: Medium (6h)
1746
+ - Dependencies: Arena results
1747
+ - Owner: UI team
1748
+ - Deliverable: Shows improvement metrics (+15 points, etc.)
1749
+
1750
+ - [ ] **Task 5.6:** Implement feedback submission to database
1751
+ - Complexity: Small (3h)
1752
+ - Dependencies: User scoring, Pinecone API
1753
+ - Owner: API team
1754
+ - Deliverable: POST /feedback endpoint working
1755
+
1756
+ **Validation Checkpoint:**
1757
+ - [ ] Arena results stored locally
1758
+ - [ ] User can review and compare outputs
1759
+ - [ ] User can score outputs and submit feedback
1760
+ - [ ] Feedback appears in database
1761
+ - [ ] UI is clear and usable
1762
+
1763
+ **Total Effort:** ~31 hours (~1.5 weeks)
1764
+
1765
+ ---
1766
+
1767
+ ### Phase 6: Collective Database Features (Weeks 13-14)
1768
+
1769
+ **Goal:** Skill submission, leaderboards, lineage tracking
1770
+
1771
+ **Tasks:**
1772
+
1773
+ - [ ] **Task 6.1:** Implement POST /submit API endpoint
1774
+ - Complexity: Medium (8h)
1775
+ - Dependencies: Pinecone database
1776
+ - Owner: API team
1777
+ - Deliverable: Can submit skills to collective
1778
+
1779
+ - [ ] **Task 6.2:** Build submission UI with privacy controls
1780
+ - Complexity: Medium (6h)
1781
+ - Dependencies: Arena completion
1782
+ - Owner: UI team
1783
+ - Deliverable: User prompted to submit, opt-in, privacy clear
1784
+
1785
+ - [ ] **Task 6.3:** Add champion comparison logic
1786
+ - Complexity: Small (3h)
1787
+ - Dependencies: Database query, arena scores
1788
+ - Owner: API team
1789
+ - Deliverable: Detect when user skill beats database champions
1790
+
1791
+ - [ ] **Task 6.4:** Implement GET /leaderboard endpoint
1792
+ - Complexity: Small (4h)
1793
+ - Dependencies: Pinecone database
1794
+ - Owner: API team
1795
+ - Deliverable: Returns top skills by domain
1796
+
1797
+ - [ ] **Task 6.5:** Build leaderboard display UI
1798
+ - Complexity: Medium (6h)
1799
+ - Dependencies: Leaderboard API
1800
+ - Owner: UI team
1801
+ - Deliverable: Shows top skills with scores, ratings
1802
+
1803
+ - [ ] **Task 6.6:** Add skill lineage tracking
1804
+ - Complexity: Small (4h)
1805
+ - Dependencies: Submission API
1806
+ - Owner: API team
1807
+ - Deliverable: parent_id, generation, improvement_pct stored
1808
+
1809
+ - [ ] **Task 6.7:** Implement ELO rating calculation
1810
+ - Complexity: Medium (5h)
1811
+ - Dependencies: Usage data, feedback
1812
+ - Owner: API team
1813
+ - Deliverable: Skills have ELO ratings that update
1814
+
1815
+ **Validation Checkpoint:**
1816
+ - [ ] Can submit skills to collective
1817
+ - [ ] Leaderboard shows top skills accurately
1818
+ - [ ] Lineage tracking works (can trace ancestry)
1819
+ - [ ] ELO ratings update based on usage
1820
+ - [ ] Privacy controls working (anonymized data)
1821
+
1822
+ **Total Effort:** ~36 hours (~1.5 weeks)
1823
+
1824
+ ---
1825
+
1826
+ ### Phase 7: Testing & Polish (Weeks 15-16)
1827
+
1828
+ **Goal:** Comprehensive testing, bug fixes, performance optimization, documentation
1829
+
1830
+ **Tasks:**
1831
+
1832
+ - [ ] **Task 7.1:** Write comprehensive unit tests
1833
+ - Complexity: Large (16h)
1834
+ - Dependencies: All features implemented
1835
+ - Owner: QA team
1836
+ - Deliverable: > 80% code coverage
1837
+
1838
+ - [ ] **Task 7.2:** Write integration tests
1839
+ - Complexity: Large (12h)
1840
+ - Dependencies: All features implemented
1841
+ - Owner: QA team
1842
+ - Deliverable: All workflows tested end-to-end
1843
+
1844
+ - [ ] **Task 7.3:** Write E2E tests (user journeys)
1845
+ - Complexity: Medium (10h)
1846
+ - Dependencies: All features implemented
1847
+ - Owner: QA team
1848
+ - Deliverable: Key user journeys automated
1849
+
1850
+ - [ ] **Task 7.4:** Performance testing and optimization
1851
+ - Complexity: Medium (8h)
1852
+ - Dependencies: All features implemented
1853
+ - Owner: Performance team
1854
+ - Deliverable: Meet all performance targets (< 30s base, < 30min arena, etc.)
1855
+
1856
+ - [ ] **Task 7.5:** Bug fixes from testing
1857
+ - Complexity: Variable (16h estimated)
1858
+ - Dependencies: Tests written
1859
+ - Owner: All teams
1860
+ - Deliverable: All critical bugs fixed, P1 bugs addressed
1861
+
1862
+ - [ ] **Task 7.6:** Write user documentation
1863
+ - Complexity: Medium (8h)
1864
+ - Dependencies: All features implemented
1865
+ - Owner: Docs team
1866
+ - Deliverable: README, usage guide, troubleshooting
1867
+
1868
+ - [ ] **Task 7.7:** Create example arena results
1869
+ - Complexity: Small (4h)
1870
+ - Dependencies: Arena working
1871
+ - Owner: Docs team
1872
+ - Deliverable: Example outputs for documentation
1873
+
1874
+ **Validation Checkpoint:**
1875
+ - [ ] All tests passing (unit, integration, E2E)
1876
+ - [ ] Performance benchmarks met
1877
+ - [ ] Zero critical bugs, minimal P1 bugs
1878
+ - [ ] Documentation complete and clear
1879
+ - [ ] System ready for production
1880
+
1881
+ **Total Effort:** ~74 hours (~3.5 weeks)
1882
+
1883
+ ---
1884
+
1885
+ ### Phase 8: Deployment & Rollout (Weeks 17-18)
1886
+
1887
+ **Goal:** Deploy to production, gradual rollout, monitoring, iteration
1888
+
1889
+ **Tasks:**
1890
+
1891
+ - [ ] **Task 8.1:** Deploy serverless API to production
1892
+ - Complexity: Small (4h)
1893
+ - Dependencies: API tested
1894
+ - Owner: DevOps team
1895
+ - Deliverable: API live at production URL
1896
+
1897
+ - [ ] **Task 8.2:** Set up Pinecone production database
1898
+ - Complexity: Small (2h)
1899
+ - Dependencies: Schema finalized
1900
+ - Owner: DevOps team
1901
+ - Deliverable: Production Pinecone index ready
1902
+
1903
+ - [ ] **Task 8.3:** Deploy enhanced skill_creating skill
1904
+ - Complexity: Small (2h)
1905
+ - Dependencies: All features tested
1906
+ - Owner: DevOps team
1907
+ - Deliverable: Updated SKILL.md deployed
1908
+
1909
+ - [ ] **Task 8.4:** Set up monitoring and alerting
1910
+ - Complexity: Small (4h)
1911
+ - Dependencies: API deployed
1912
+ - Owner: DevOps team
1913
+ - Deliverable: Track API uptime, error rates, arena completion rates
1914
+
1915
+ - [ ] **Task 8.5:** Gradual rollout (beta users first)
1916
+ - Complexity: Small (2h)
1917
+ - Dependencies: Monitoring set up
1918
+ - Owner: Product team
1919
+ - Deliverable: 10 beta users test system
1920
+
1921
+ - [ ] **Task 8.6:** Gather feedback from beta users
1922
+ - Complexity: Medium (8h)
1923
+ - Dependencies: Beta rollout
1924
+ - Owner: Product team
1925
+ - Deliverable: Feedback collected, prioritized
1926
+
1927
+ - [ ] **Task 8.7:** Iterate based on feedback
1928
+ - Complexity: Medium (8h)
1929
+ - Dependencies: Feedback gathered
1930
+ - Owner: All teams
1931
+ - Deliverable: Top 3-5 improvements implemented
1932
+
1933
+ - [ ] **Task 8.8:** Full public launch
1934
+ - Complexity: Small (2h)
1935
+ - Dependencies: Beta successful
1936
+ - Owner: Product team
1937
+ - Deliverable: Announcement, full availability
1938
+
1939
+ **Validation Checkpoint:**
1940
+ - [ ] API stable and responding correctly
1941
+ - [ ] Beta users successfully creating and optimizing skills
1942
+ - [ ] Monitoring shows healthy metrics
1943
+ - [ ] Feedback incorporated
1944
+ - [ ] Ready for public launch
1945
+
1946
+ **Total Effort:** ~32 hours (~1.5 weeks)
1947
+
1948
+ ---
1949
+
1950
+ ### Task Dependencies Visualization
1951
+
1952
+ ```
1953
+ Phase 1 (Foundation):
1954
+ 1.1 (Pinecone) → 1.2 (Embeddings) → 1.3 (Search API)
1955
+ 1.4 (Requirements) ──────────────────┐
1956
+ 1.5 (Pattern Research) ──────────────┤
1957
+ 1.6 (Domain Research) ───────────────┴─→ Phase 2
1958
+
1959
+ Phase 2 (Base Generation):
1960
+ [Phase 1] → 2.1 (Questions) → 2.2 (Weights) → 2.3 (Template) → 2.4 (Deploy)
1961
+ 1.3 (Search) ────────────────────────────────────────┐
1962
+ 2.4 (Deploy) ────────────────────────────────────────┴─→ 2.5 (Workflow)
1963
+ 2.4 (Deploy) → 2.6 (Quick Scoring)
1964
+
1965
+ Phase 3 (Arena Core):
1966
+ 3.1 (Orchestrator) ──────────────────────────────────┐
1967
+ 3.2 (Variation Gen) ─────────────────────────────────┤
1968
+ 3.3 (Test Data) ─────────────────────────────────────┤
1969
+ 3.4 (Execution) → 3.5 (Output Capture) ──────────────┼─→ 3.8 (Ranking)
1970
+ 3.6 (Judge) → 3.7 (Bias Mitigation) ─────────────────┘
1971
+
1972
+ Phase 4 (Convergence):
1973
+ [Phase 3] → 4.1 (Tournament Loop) → 4.2 (Convergence) → 4.6 (Notification)
1974
+ 4.1 → 4.3 (Adaptive Sizing)
1975
+ 4.1 → 4.4 (State Persist) → 4.5 (Interruption)
1976
+
1977
+ Phase 5 (User Validation):
1978
+ 3.5 (Output Capture) → 5.1 (Local Storage) → 5.2 (Browse UI)
1979
+ 5.2 → 5.3 (Comparison) → 5.4 (User Scoring) → 5.6 (Feedback API)
1980
+ 4.6 (Notification) → 5.5 (Version Compare)
1981
+
1982
+ Phase 6 (Collective):
1983
+ 1.1 (Pinecone) ──────────────────────────────────────┐
1984
+ 4.2 (Convergence) → 6.1 (Submit API) → 6.2 (Submit UI)│
1985
+ 6.1 → 6.3 (Champion Compare) ───────────────────────┤
1986
+ 6.1 → 6.4 (Leaderboard API) → 6.5 (Leaderboard UI) ─┤
1987
+ 6.1 → 6.6 (Lineage) ─────────────────────────────────┤
1988
+ 6.1 → 6.7 (ELO) ─────────────────────────────────────┘
1989
+
1990
+ Phase 7 (Testing):
1991
+ [All Phases] → 7.1 (Unit) ─┐
1992
+ 7.2 (Integration) ─┼→ 7.5 (Bug Fixes) → 7.6 (Docs)
1993
+ 7.3 (E2E) ─┘ → 7.4 (Performance) → 7.7 (Examples)
1994
+
1995
+ Phase 8 (Deployment):
1996
+ 7.5 (Bug Fixes) → 8.1 (API Deploy) ─┐
1997
+ 7.5 → 8.2 (Pinecone Prod) ──────────┼→ 8.4 (Monitoring) → 8.5 (Beta)
1998
+ 7.5 → 8.3 (Skill Deploy) ───────────┘
1999
+ 8.5 → 8.6 (Feedback) → 8.7 (Iterate) → 8.8 (Launch)
2000
+
2001
+ Critical Path:
2002
+ 1.1 → 1.2 → 1.3 → 2.5 → 3.1 → 3.6 → 4.1 → 4.2 → 5.1 → 6.1 → 7.5 → 8.8
2003
+ ```
2004
+
2005
+ ---
2006
+
2007
+ ### Effort Estimation
2008
+
2009
+ **Total Estimated Effort by Phase:**
2010
+ - Phase 1 (Foundation): 33 hours
2011
+ - Phase 2 (Base Generation): 34 hours
2012
+ - Phase 3 (Arena Core): 63 hours
2013
+ - Phase 4 (Convergence): 31 hours
2014
+ - Phase 5 (User Validation): 31 hours
2015
+ - Phase 6 (Collective): 36 hours
2016
+ - Phase 7 (Testing): 74 hours
2017
+ - Phase 8 (Deployment): 32 hours
2018
+
2019
+ **Total: ~334 hours**
2020
+
2021
+ **With 2-person team:**
2022
+ - ~167 hours per person
2023
+ - ~21 weeks at 8 hours/week
2024
+ - **~5 months calendar time**
2025
+
2026
+ **With 3-person team:**
2027
+ - ~111 hours per person
2028
+ - ~14 weeks at 8 hours/week
2029
+ - **~3.5 months calendar time**
2030
+
2031
+ **Risk Buffer:** +25% (84 hours) for unknowns, integration challenges, and iteration
2032
+
2033
+ **Final Estimate with Buffer:**
2034
+ - 2-person team: **~6-7 months**
2035
+ - 3-person team: **~4-5 months**
2036
+
2037
+ **MVP (Phases 1-5 only):**
2038
+ - Total: 192 hours
2039
+ - 2-person team: ~3 months
2040
+ - 3-person team: ~2 months
2041
+
2042
+ ---
2043
+
2044
+ ## Out of Scope
2045
+
2046
+ Explicitly NOT included in this release:
2047
+
2048
+ ### 1. Server Farm Optimization (Future Phase 2)
2049
+
2050
+ **What:** Centralized server farms running continuous global tournaments of all submitted skills to find absolute best across all users.
2051
+
2052
+ **Why Out of Scope:**
2053
+ - Requires significant infrastructure ($$)
2054
+ - MVP focuses on local execution (user pays own tokens)
2055
+ - Can be added later without changing architecture
2056
+ - Future enhancement: 24/7 deep research and optimization
2057
+
2058
+ **Future Consideration:** Document for Phase 2, 6-12 months post-launch
2059
+
2060
+ ---
2061
+
2062
+ ### 2. Multi-Language Skill Support
2063
+
2064
+ **What:** Skills that work across multiple programming languages or natural languages.
2065
+
2066
+ **Why Out of Scope:**
2067
+ - MVP focuses on English, code-agnostic skills
2068
+ - Adds complexity to question generation and judging
2069
+ - Limited user demand in initial research
2070
+
2071
+ **Future Consideration:** Add if international users request it
2072
+
2073
+ ---
2074
+
2075
+ ### 3. Skill Marketplace / Monetization
2076
+
2077
+ **What:** Paid skills, premium collective access, skill authors earning revenue.
2078
+
2079
+ **Why Out of Scope:**
2080
+ - Collective is free and open initially
2081
+ - Monetization complicates launch
2082
+ - Focus on quality and adoption first
2083
+
2084
+ **Future Consideration:** Evaluate after 100+ skills and 1000+ users
2085
+
2086
+ ---
2087
+
2088
+ ### 4. IDE Integration Beyond Claude Code
2089
+
2090
+ **What:** VS Code extension, JetBrains plugin, etc.
2091
+
2092
+ **Why Out of Scope:**
2093
+ - MVP is Claude Code only
2094
+ - Different architecture for each IDE
2095
+ - Limited resources
2096
+
2097
+ **Future Consideration:** Partner integrations post-launch
2098
+
2099
+ ---
2100
+
2101
+ ### 5. Advanced Judge Models (GPT-4, Gemini, etc.)
2102
+
2103
+ **What:** Support for non-Anthropic judge models.
2104
+
2105
+ **Why Out of Scope:**
2106
+ - Anthropic models (Sonnet, Opus) sufficient for MVP
2107
+ - Cross-platform adds complexity
2108
+ - Focus on one provider initially
2109
+
2110
+ **Future Consideration:** Add multi-model support if users request it
2111
+
2112
+ ---
2113
+
2114
+ ### 6. Real-Time Collaboration on Skills
2115
+
2116
+ **What:** Multiple users co-creating skills simultaneously.
2117
+
2118
+ **Why Out of Scope:**
2119
+ - MVP is single-user workflow
2120
+ - Requires collaborative editing infrastructure
2121
+ - Not in initial requirements
2122
+
2123
+ **Future Consideration:** If teams request it
2124
+
2125
+ ---
2126
+
2127
+ ### 7. Automated Skill Maintenance
2128
+
2129
+ **What:** System automatically updates skills when dependencies change or new best practices emerge.
2130
+
2131
+ **Why Out of Scope:**
2132
+ - Requires continuous monitoring
2133
+ - Risk of breaking working skills
2134
+ - Manual updates sufficient for MVP
2135
+
2136
+ **Future Consideration:** Auto-suggest updates, user approves
2137
+
2138
+ ---
2139
+
2140
+ ### 8. Skill Analytics Dashboard
2141
+
2142
+ **What:** Detailed analytics on skill usage, performance, user demographics.
2143
+
2144
+ **Why Out of Scope:**
2145
+ - Basic metrics sufficient for MVP (ratings, usage count)
2146
+ - Privacy concerns with detailed tracking
2147
+ - Focus on core functionality first
2148
+
2149
+ **Future Consideration:** Opt-in analytics for skill authors
2150
+
2151
+ ---
2152
+
2153
+ ## Open Questions & Risks
2154
+
2155
+ ### Open Questions
2156
+
2157
+ #### Q1: What judge model should we use?
2158
+
2159
+ **Current Status:** Considering Opus (highest quality) vs Sonnet (faster, cheaper)
2160
+
2161
+ **Options:**
2162
+ - **A)** Always use Opus for judging (consistent, highest quality)
2163
+ - **B)** Haiku for early rounds, Opus for finals (optimized cost)
2164
+ - **C)** User chooses in questions (flexibility)
2165
+
2166
+ **Owner:** Technical lead
2167
+
2168
+ **Deadline:** End of Phase 2 (before arena implementation)
2169
+
2170
+ **Impact:** Medium (affects arena runtime and cost)
2171
+
2172
+ **Recommendation:** Option B (progressive) - balances quality and cost. Early rounds don't need Opus-level precision, finals do.
2173
+
2174
+ ---
2175
+
2176
+ #### Q2: How many test scenarios per round?
2177
+
2178
+ **Current Status:** Considering 1 vs 3 scenarios per round
2179
+
2180
+ **Options:**
2181
+ - **A)** 1 scenario per round (faster, less reliable)
2182
+ - **B)** 3 scenarios per round (slower, more reliable)
2183
+ - **C)** Adaptive based on skill complexity (1 for simple, 3 for complex)
2184
+
2185
+ **Owner:** Arena architect
2186
+
2187
+ **Deadline:** End of Phase 3 (before convergence testing)
2188
+
2189
+ **Impact:** High (affects arena reliability and runtime)
2190
+
2191
+ **Recommendation:** Option C (adaptive) - simple skills don't need extensive testing, complex skills do.
2192
+
2193
+ ---
2194
+
2195
+ #### Q3: Should we allow user override of convergence criteria?
2196
+
2197
+ **Current Status:** System auto-detects convergence
2198
+
2199
+ **Options:**
2200
+ - **A)** Fully automatic (no user control)
2201
+ - **B)** User can set max time/rounds (simple override)
2202
+ - **C)** User can configure all criteria (full control)
2203
+
2204
+ **Owner:** Product team
2205
+
2206
+ **Deadline:** End of Phase 4 (convergence implementation)
2207
+
2208
+ **Impact:** Low (nice-to-have, not critical)
2209
+
2210
+ **Recommendation:** Option B (simple override) - most users trust defaults, power users want time control.
2211
+
2212
+ ---
2213
+
2214
+ #### Q4: How should we handle skill activation conflicts?
2215
+
2216
+ **Current Status:** Multiple skills might activate on same trigger
2217
+
2218
+ **Options:**
2219
+ - **A)** First match wins (simple, may not be best)
2220
+ - **B)** Highest-scored skill wins (quality-focused)
2221
+ - **C)** User prompted to choose (manual, slower)
2222
+
2223
+ **Owner:** skill_creating maintainer
2224
+
2225
+ **Deadline:** End of Phase 2 (affects base generation)
2226
+
2227
+ **Impact:** Medium (affects user experience)
2228
+
2229
+ **Recommendation:** Option B (highest-scored) - leverages arena scores for automatic quality selection.
2230
+
2231
+ ---
2232
+
2233
+ #### Q5: Should database submissions be moderated?
2234
+
2235
+ **Current Status:** Considering auto-accept vs review queue
2236
+
2237
+ **Options:**
2238
+ - **A)** Auto-accept all submissions (fast, risk of spam)
2239
+ - **B)** Auto-scan for dangerous patterns, flag suspicious (balanced)
2240
+ - **C)** Manual review all submissions (slow, thorough)
2241
+
2242
+ **Owner:** Security team
2243
+
2244
+ **Deadline:** End of Phase 6 (before collective launch)
2245
+
2246
+ **Impact:** High (affects collective quality and safety)
2247
+
2248
+ **Recommendation:** Option B (auto-scan + flag) - prevents obvious abuse, human review for edge cases.
2249
+
2250
+ ---
2251
+
2252
+ ### Risks & Mitigation
2253
+
2254
+ | Risk | Likelihood | Impact | Severity | Mitigation | Contingency |
2255
+ |------|------------|--------|----------|------------|-------------|
2256
+ | Arena takes longer than 30 min target | Medium | Medium | **Medium** | Adaptive tournament sizing (simple skills fewer rounds), early convergence detection, user can interrupt | User can keep using v0.1 if arena takes too long, improve convergence criteria based on data |
2257
+ | Test data not realistic enough | Medium | High | **High** | Web search for current examples, persona-based generation, evolution refinement, user validation scoring | Cache proven realistic scenarios, allow users to provide examples, improve generation prompts |
2258
+ | Judge produces inconsistent results | Medium | High | **High** | Multiple comparisons with order randomization, chain-of-thought required, user review override, aggregate multiple calls | Flag inconsistent judgments for human review, use ensemble of judges, tune prompts |
2259
+ | User machine offline during arena | Low | Medium | **Low** | State persistence (save progress), resume capability, graceful degradation (partial results usable) | Allow arena restart from last checkpoint, warn users before starting |
2260
+ | Pinecone API down or rate limited | Low | High | **Medium** | Graceful degradation (create skills without search), retry logic with exponential backoff, cache common queries | Fall back to local-only mode, queue submissions for later |
2261
+ | Malicious skill submissions | Low | Critical | **High** | Automated scanning for dangerous patterns (rm -rf, curl to unknown domains), review queue for flagged skills, rate limiting per user, community flagging | Manual review, blocklist, user reputation system |
2262
+ | High API costs (OpenAI embeddings) | Medium | Low | **Low** | Cache embeddings aggressively, batch requests, use efficient models (text-embed-3-small), minimal re-embedding | User pays own API costs, optimize caching strategy |
2263
+ | Low skill submission rate | Medium | Medium | **Medium** | Encourage submissions (special notification if beats champion), gamification (leaderboards), showcase benefits (others using your skill) | Pre-populate database with high-quality seed skills, improve submission UX |
2264
+ | User confusion about v0.1 vs v1.0 | Medium | Low | **Low** | Clear labeling (v0.1 "Quick Base", v1.0 "Arena Optimized"), show improvement metrics (+15 points), user can test both | Improve UI copy, add tooltips, documentation |
2265
+
2266
+ ---
2267
+
2268
+ ## Validation Checkpoints
2269
+
2270
+ ### Checkpoint 1: End of Phase 1 (Foundation)
2271
+
2272
+ **Criteria:**
2273
+ - [ ] Pinecone database set up and queryable
2274
+ - [ ] Can generate embeddings and search semantically
2275
+ - [ ] Search returns relevant results in < 2s
2276
+ - [ ] Can extract requirements from user requests
2277
+ - [ ] Quick research (pattern + domain) completes in < 20s
2278
+ - [ ] Unit tests passing for all components
2279
+
2280
+ **If Failed:**
2281
+ - Debug Pinecone integration
2282
+ - Optimize search query performance
2283
+ - Improve requirement extraction accuracy
2284
+ - Don't proceed to Phase 2 until stable
2285
+
2286
+ ---
2287
+
2288
+ ### Checkpoint 2: End of Phase 2 (Base Generation)
2289
+
2290
+ **Criteria:**
2291
+ - [ ] Question generator produces relevant domain-specific questions
2292
+ - [ ] User answers convert to reasonable weights
2293
+ - [ ] Base skill (v0.1) generates in < 30s
2294
+ - [ ] v0.1 skill is valid SKILL.md and activates correctly
2295
+ - [ ] Database-first workflow works end-to-end (query → select → deploy)
2296
+ - [ ] Integration tests passing
2297
+
2298
+ **If Failed:**
2299
+ - Refine question templates by domain
2300
+ - Fix skill template generation bugs
2301
+ - Improve database query UI
2302
+ - Don't proceed to Phase 3 until v0.1 generation reliable
2303
+
2304
+ ---
2305
+
2306
+ ### Checkpoint 3: End of Phase 3 (Arena Core)
2307
+
2308
+ **Criteria:**
2309
+ - [ ] Can generate 3 variations from base skill
2310
+ - [ ] All variations execute with same test input
2311
+ - [ ] Test data is realistic (validated by spot-checking)
2312
+ - [ ] Judge compares outputs and selects winner with reasoning
2313
+ - [ ] Bradley-Terry ranking produces consistent results
2314
+ - [ ] Background orchestration runs without blocking user
2315
+ - [ ] Integration tests for arena passing
2316
+
2317
+ **If Failed:**
2318
+ - Improve variation generator diversity
2319
+ - Enhance test data realism validation
2320
+ - Tune judge prompts for consistency
2321
+ - Don't proceed to Phase 4 until core arena reliable
2322
+
2323
+ ---
2324
+
2325
+ ### Checkpoint 4: End of Phase 4 (Convergence)
2326
+
2327
+ **Criteria:**
2328
+ - [ ] Arena runs multiple rounds until convergence
2329
+ - [ ] Convergence detects score plateau correctly
2330
+ - [ ] Time and iteration limits prevent infinite loops
2331
+ - [ ] Can interrupt and resume arena from saved state
2332
+ - [ ] Arena completes in < 30 min for moderate skills (test with 5+ examples)
2333
+ - [ ] User receives notification when arena completes
2334
+
2335
+ **If Failed:**
2336
+ - Adjust convergence thresholds (may need < 2% → < 3%)
2337
+ - Optimize arena performance (reduce round time)
2338
+ - Fix state persistence bugs
2339
+ - Don't proceed to Phase 5 until convergence robust
2340
+
2341
+ ---
2342
+
2343
+ ### Checkpoint 5: End of Phase 5 (User Validation)
2344
+
2345
+ **Criteria:**
2346
+ - [ ] Arena results stored locally and readable
2347
+ - [ ] User can browse all round results
2348
+ - [ ] Side-by-side comparison shows outputs clearly
2349
+ - [ ] User can score outputs (1-5 stars)
2350
+ - [ ] Version comparison (v0.1 vs v1.0) shows improvement metrics
2351
+ - [ ] Feedback submission to database works
2352
+
2353
+ **If Failed:**
2354
+ - Improve UI clarity and usability
2355
+ - Fix local storage bugs
2356
+ - Ensure feedback API working
2357
+ - Don't proceed to Phase 6 until user can validate effectively
2358
+
2359
+ ---
2360
+
2361
+ ### Checkpoint 6: End of Phase 6 (Collective)
2362
+
2363
+ **Criteria:**
2364
+ - [ ] Can submit skills to collective database
2365
+ - [ ] Submissions include all required metadata (scores, weights, lineage)
2366
+ - [ ] Leaderboard shows top skills accurately
2367
+ - [ ] Lineage tracking works (can trace parent → child)
2368
+ - [ ] ELO ratings update based on usage
2369
+ - [ ] Privacy controls working (anonymized data, no PII)
2370
+ - [ ] Champion comparison detects when user skill beats database leaders
2371
+
2372
+ **If Failed:**
2373
+ - Fix submission API bugs
2374
+ - Improve leaderboard ranking algorithm
2375
+ - Ensure privacy compliance
2376
+ - Don't proceed to Phase 7 until collective reliable
2377
+
2378
+ ---
2379
+
2380
+ ### Checkpoint 7: End of Phase 7 (Testing)
2381
+
2382
+ **Criteria:**
2383
+ - [ ] Unit test coverage > 80%
2384
+ - [ ] All integration tests passing
2385
+ - [ ] E2E tests cover key user journeys
2386
+ - [ ] Performance benchmarks met:
2387
+ - [ ] Base generation < 30s (p95)
2388
+ - [ ] Database query < 2s (p95)
2389
+ - [ ] Arena completion < 30 min moderate skills (p95)
2390
+ - [ ] Zero critical bugs (P0)
2391
+ - [ ] P1 bugs addressed or accepted as known issues
2392
+ - [ ] Documentation complete (README, usage guide, troubleshooting)
2393
+
2394
+ **If Failed:**
2395
+ - Fix critical bugs before launch
2396
+ - Optimize performance to meet benchmarks
2397
+ - Complete documentation
2398
+ - Don't proceed to Phase 8 until quality gates met
2399
+
2400
+ ---
2401
+
2402
+ ### Checkpoint 8: Production Launch
2403
+
2404
+ **Criteria (at each rollout stage):**
2405
+
2406
+ **Beta (10 users):**
2407
+ - [ ] At least 5 skills created successfully
2408
+ - [ ] At least 2 arenas completed successfully
2409
+ - [ ] Error rate < 5% (acceptable for beta)
2410
+ - [ ] User feedback collected (survey or interviews)
2411
+
2412
+ **If Beta Failed:**
2413
+ - Fix bugs discovered
2414
+ - Improve UX based on feedback
2415
+ - Iterate before wider rollout
2416
+
2417
+ **Public Launch (all users):**
2418
+ - [ ] Beta successful with no critical issues
2419
+ - [ ] Monitoring shows healthy metrics (API uptime > 99%, error rate < 1%)
2420
+ - [ ] Database has at least 10 seed skills (pre-populated)
2421
+ - [ ] Documentation published and accessible
2422
+ - [ ] Support channel ready (GitHub issues, Discord, etc.)
2423
+
2424
+ **If Public Launch Failed:**
2425
+ - Rollback to beta (disable for most users)
2426
+ - Fix issues
2427
+ - Re-launch when stable
2428
+
2429
+ ---
2430
+
2431
+ ## Appendix: Task Breakdown Hints
2432
+
2433
+ ### Suggested Taskmaster Task Structure
2434
+
2435
+ **Phase 1: Foundation (10 tasks, ~33 hours)**
2436
+ 1. Set up Pinecone database and schema (6h) - Dependencies: None
2437
+ 2. Implement OpenAI embeddings generation (3h) - Dependencies: Task 1
2438
+ 3. Build POST /search API endpoint (8h) - Dependencies: Tasks 1, 2
2439
+ 4. Implement requirement extraction (6h) - Dependencies: None
2440
+ 5. Build quick pattern research (4h) - Dependencies: None
2441
+ 6. Build quick domain research via WebSearch (6h) - Dependencies: None
2442
+
2443
+ **Phase 2: Base Generation (6 tasks, ~34 hours)**
2444
+ 7. Build question generator agent (8h) - Dependencies: Tasks 5, 6
2445
+ 8. Implement answer-to-weight conversion (5h) - Dependencies: Task 7
2446
+ 9. Create base skill template engine (6h) - Dependencies: Tasks 5, 6, 7
2447
+ 10. Implement skill deployment automation (3h) - Dependencies: Task 9
2448
+ 11. Build database-first workflow integration (8h) - Dependencies: Tasks 3, 10
2449
+ 12. Add optional quick scoring of v0.1 (4h) - Dependencies: Task 10
2450
+
2451
+ **Phase 3: Arena Core (8 tasks, ~63 hours)**
2452
+ 13. Implement orchestrator-worker pattern via Task tool (12h) - Dependencies: None
2453
+ 14. Build skill variation generator agent (10h) - Dependencies: Task 9
2454
+ 15. Build realistic test data generator with web search (8h) - Dependencies: None
2455
+ 16. Implement skill execution sandbox (8h) - Dependencies: None
2456
+ 17. Build output capture system (4h) - Dependencies: Task 16
2457
+ 18. Implement LLM-as-judge with pairwise comparison (12h) - Dependencies: Task 8
2458
+ 19. Add position bias mitigation (3h) - Dependencies: Task 18
2459
+ 20. Implement Bradley-Terry ranking (6h) - Dependencies: Task 18
2460
+
2461
+ **Phase 4: Convergence (6 tasks, ~31 hours)**
2462
+ 21. Build tournament iteration loop (8h) - Dependencies: Phase 3 complete
2463
+ 22. Implement convergence detection (6h) - Dependencies: Task 21
2464
+ 23. Add adaptive tournament sizing (4h) - Dependencies: Task 21
2465
+ 24. Build state persistence system (6h) - Dependencies: Task 21
2466
+ 25. Implement graceful interruption handling (4h) - Dependencies: Task 24
2467
+ 26. Build arena completion notification (3h) - Dependencies: Task 22
2468
+
2469
+ **Phase 5: User Validation (6 tasks, ~31 hours)**
2470
+ 27. Create local arena results storage (4h) - Dependencies: Task 17
2471
+ 28. Build results browsing UI (8h) - Dependencies: Task 27
2472
+ 29. Implement side-by-side output comparison (6h) - Dependencies: Task 28
2473
+ 30. Add user scoring interface (4h) - Dependencies: Task 29
2474
+ 31. Build version comparison (v0.1 vs v1.0) (6h) - Dependencies: Task 27
2475
+ 32. Implement feedback submission to database (3h) - Dependencies: Task 30
2476
+
2477
+ **Phase 6: Collective (7 tasks, ~36 hours)**
2478
+ 33. Implement POST /submit API endpoint (8h) - Dependencies: Task 1
2479
+ 34. Build submission UI with privacy controls (6h) - Dependencies: Task 26
2480
+ 35. Add champion comparison logic (3h) - Dependencies: Tasks 3, 22
2481
+ 36. Implement GET /leaderboard endpoint (4h) - Dependencies: Task 1
2482
+ 37. Build leaderboard display UI (6h) - Dependencies: Task 36
2483
+ 38. Add skill lineage tracking (4h) - Dependencies: Task 33
2484
+ 39. Implement ELO rating calculation (5h) - Dependencies: Task 33
2485
+
2486
+ **Phase 7: Testing (7 tasks, ~74 hours)**
2487
+ 40. Write comprehensive unit tests (16h) - Dependencies: All features
2488
+ 41. Write integration tests (12h) - Dependencies: All features
2489
+ 42. Write E2E tests (10h) - Dependencies: All features
2490
+ 43. Performance testing and optimization (8h) - Dependencies: All features
2491
+ 44. Bug fixes from testing (16h) - Dependencies: Tasks 40-43
2492
+ 45. Write user documentation (8h) - Dependencies: Task 44
2493
+ 46. Create example arena results (4h) - Dependencies: Task 44
2494
+
2495
+ **Phase 8: Deployment (8 tasks, ~32 hours)**
2496
+ 47. Deploy serverless API to production (4h) - Dependencies: Task 44
2497
+ 48. Set up Pinecone production database (2h) - Dependencies: Task 44
2498
+ 49. Deploy enhanced skill_creating skill (2h) - Dependencies: Task 44
2499
+ 50. Set up monitoring and alerting (4h) - Dependencies: Task 47
2500
+ 51. Gradual rollout (beta users) (2h) - Dependencies: Task 50
2501
+ 52. Gather feedback from beta users (8h) - Dependencies: Task 51
2502
+ 53. Iterate based on feedback (8h) - Dependencies: Task 52
2503
+ 54. Full public launch (2h) - Dependencies: Task 53
2504
+
2505
+ **Total: 54 tasks, ~334 hours**
2506
+
2507
+ ---
2508
+
2509
+ ### Parallelizable Tasks
2510
+
2511
+ **Can work in parallel:**
2512
+
2513
+ **Phase 1:**
2514
+ - Tasks 1-2 (sequential), Tasks 4-6 (parallel)
2515
+
2516
+ **Phase 2:**
2517
+ - Task 7 (after 5-6), Tasks 8-10 (mostly sequential), Task 11 (integrates), Task 12 (parallel with 11)
2518
+
2519
+ **Phase 3:**
2520
+ - Tasks 13-15 (parallel), Tasks 16-17 (sequential), Tasks 18-20 (sequential)
2521
+ - Two sub-teams: Orchestration (13-15) and Execution+Judge (16-20)
2522
+
2523
+ **Phase 4:**
2524
+ - Tasks 21-26 mostly sequential (each depends on previous)
2525
+
2526
+ **Phase 5:**
2527
+ - Tasks 27-32 mostly sequential, but Task 31 parallel with 28-30
2528
+
2529
+ **Phase 6:**
2530
+ - Tasks 33-39 parallel after Task 33 complete
2531
+
2532
+ **Phase 7:**
2533
+ - Tasks 40-43 parallel, Task 44 after all tests, Tasks 45-46 parallel
2534
+
2535
+ **Phase 8:**
2536
+ - Tasks 47-49 parallel, Task 50 after 47, Tasks 51-54 sequential
2537
+
2538
+ ---
2539
+
2540
+ ### Critical Path Tasks
2541
+
2542
+ **Critical path (longest dependency chain):**
2543
+ 1. Set up Pinecone (Task 1)
2544
+ 2. Build embeddings (Task 2)
2545
+ 3. Build search API (Task 3)
2546
+ 4. Build workflow integration (Task 11)
2547
+ 5. Implement orchestrator (Task 13)
2548
+ 6. Implement judge (Task 18)
2549
+ 7. Build tournament loop (Task 21)
2550
+ 8. Implement convergence (Task 22)
2551
+ 9. Create local storage (Task 27)
2552
+ 10. Implement submit API (Task 33)
2553
+ 11. Bug fixes (Task 44)
2554
+ 12. Public launch (Task 54)
2555
+
2556
+ **Critical path duration:** ~85 hours (~11 weeks at 8h/week with 1 person on critical path)
2557
+
2558
+ **With parallel teams (3 people):**
2559
+ - Critical path person: 85 hours
2560
+ - Arena team: 63 hours (Phase 3)
2561
+ - UI/Validation team: 31 hours (Phase 5)
2562
+ - **Total calendar time: ~14 weeks (3.5 months)**
2563
+
2564
+ ---
2565
+
2566
+ **End of PRD**
2567
+
2568
+ *This PRD is optimized for taskmaster AI task generation. All requirements include task breakdown hints, complexity estimates, and dependency mapping to enable effective automated task planning.*
2569
+
2570
+ **Ready for Implementation:** Yes
2571
+ **Next Steps:** Review PRD → Approve → Begin Phase 1 (Foundation) → taskmaster init → taskmaster generate