claude-code-pilot 3.0.0 → 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (124) hide show
  1. package/README.md +76 -97
  2. package/bin/install.js +13 -13
  3. package/manifest.json +1 -1
  4. package/package.json +1 -1
  5. package/src/agents/doc-updater.md +1 -1
  6. package/src/agents/gan-evaluator.md +209 -0
  7. package/src/agents/gan-generator.md +131 -0
  8. package/src/agents/gan-planner.md +99 -0
  9. package/src/agents/harness-optimizer.md +35 -0
  10. package/src/agents/loop-operator.md +36 -0
  11. package/src/agents/opensource-forker.md +198 -0
  12. package/src/agents/opensource-packager.md +249 -0
  13. package/src/agents/opensource-sanitizer.md +188 -0
  14. package/src/agents/performance-optimizer.md +446 -0
  15. package/src/available-rules/README.md +1 -1
  16. package/src/commands/{aside.md → ccp/aside.md} +14 -13
  17. package/src/commands/{build-fix.md → ccp/build-fix.md} +5 -0
  18. package/src/commands/{checkpoint.md → ccp/checkpoint.md} +12 -7
  19. package/src/commands/{code-review.md → ccp/code-review.md} +5 -0
  20. package/src/commands/{context-budget.md → ccp/context-budget.md} +2 -1
  21. package/src/commands/{cpp-build.md → ccp/cpp-build.md} +6 -5
  22. package/src/commands/{cpp-review.md → ccp/cpp-review.md} +7 -6
  23. package/src/commands/{cpp-test.md → ccp/cpp-test.md} +6 -5
  24. package/src/commands/ccp/docs-update.md +48 -0
  25. package/src/commands/{docs.md → ccp/docs.md} +4 -3
  26. package/src/commands/{e2e.md → ccp/e2e.md} +7 -6
  27. package/src/commands/{eval.md → ccp/eval.md} +10 -5
  28. package/src/commands/{evolve.md → ccp/evolve.md} +3 -3
  29. package/src/commands/{go-build.md → ccp/go-build.md} +6 -5
  30. package/src/commands/{go-review.md → ccp/go-review.md} +7 -6
  31. package/src/commands/{go-test.md → ccp/go-test.md} +6 -5
  32. package/src/commands/{gradle-build.md → ccp/gradle-build.md} +1 -0
  33. package/src/commands/{harness-audit.md → ccp/harness-audit.md} +6 -1
  34. package/src/commands/{kotlin-build.md → ccp/kotlin-build.md} +6 -5
  35. package/src/commands/{kotlin-review.md → ccp/kotlin-review.md} +7 -6
  36. package/src/commands/{kotlin-test.md → ccp/kotlin-test.md} +6 -5
  37. package/src/commands/{learn.md → ccp/learn.md} +7 -2
  38. package/src/commands/{model-route.md → ccp/model-route.md} +6 -1
  39. package/src/commands/{orchestrate.md → ccp/orchestrate.md} +4 -3
  40. package/src/commands/{plan.md → ccp/plan.md} +6 -5
  41. package/src/commands/ccp/profile-user.md +46 -0
  42. package/src/commands/{prompt-optimize.md → ccp/prompt-optimize.md} +3 -2
  43. package/src/commands/{prune.md → ccp/prune.md} +4 -4
  44. package/src/commands/{python-review.md → ccp/python-review.md} +7 -6
  45. package/src/commands/{quality-gate.md → ccp/quality-gate.md} +6 -1
  46. package/src/commands/{refactor-clean.md → ccp/refactor-clean.md} +5 -0
  47. package/src/commands/{resume-session.md → ccp/resume-session.md} +9 -8
  48. package/src/commands/ccp/review.md +37 -0
  49. package/src/commands/{rules-distill.md → ccp/rules-distill.md} +2 -1
  50. package/src/commands/{rust-build.md → ccp/rust-build.md} +6 -5
  51. package/src/commands/{rust-review.md → ccp/rust-review.md} +7 -6
  52. package/src/commands/{rust-test.md → ccp/rust-test.md} +6 -5
  53. package/src/commands/{save-session.md → ccp/save-session.md} +2 -1
  54. package/src/commands/ccp/secure-phase.md +35 -0
  55. package/src/commands/{sessions.md → ccp/sessions.md} +29 -24
  56. package/src/commands/{setup-pm.md → ccp/setup-pm.md} +1 -0
  57. package/src/commands/{setup-refresh.md → ccp/setup-refresh.md} +4 -3
  58. package/src/commands/{setup.md → ccp/setup.md} +24 -23
  59. package/src/commands/{skill-create.md → ccp/skill-create.md} +8 -8
  60. package/src/commands/{skill-health.md → ccp/skill-health.md} +5 -5
  61. package/src/commands/{tdd.md → ccp/tdd.md} +9 -8
  62. package/src/commands/{test-coverage.md → ccp/test-coverage.md} +5 -0
  63. package/src/commands/{tool-guide.md → ccp/tool-guide.md} +2 -1
  64. package/src/commands/{update-codemaps.md → ccp/update-codemaps.md} +5 -0
  65. package/src/commands/{update-docs.md → ccp/update-docs.md} +5 -0
  66. package/src/commands/{verify.md → ccp/verify.md} +5 -0
  67. package/src/commands/ccp/workstreams.md +68 -0
  68. package/src/examples/CLAUDE.md +4 -4
  69. package/src/examples/django-api-CLAUDE.md +5 -5
  70. package/src/examples/go-microservice-CLAUDE.md +6 -6
  71. package/src/examples/rust-api-CLAUDE.md +4 -4
  72. package/src/examples/saas-nextjs-CLAUDE.md +8 -8
  73. package/src/hooks/session-start.js +1 -1
  74. package/src/pilot/references/mcp-servers.json +1 -1
  75. package/src/pilot/workflows/docs-update.md +1165 -0
  76. package/src/pilot/workflows/help.md +48 -56
  77. package/src/pilot/workflows/profile-user.md +452 -0
  78. package/src/pilot/workflows/review.md +244 -0
  79. package/src/pilot/workflows/secure-phase.md +164 -0
  80. package/src/rules/common/code-review.md +124 -0
  81. package/src/rules/zh/README.md +108 -0
  82. package/src/rules/zh/agents.md +50 -0
  83. package/src/rules/zh/code-review.md +124 -0
  84. package/src/rules/zh/coding-style.md +48 -0
  85. package/src/rules/zh/development-workflow.md +44 -0
  86. package/src/rules/zh/git-workflow.md +24 -0
  87. package/src/rules/zh/hooks.md +30 -0
  88. package/src/rules/zh/patterns.md +31 -0
  89. package/src/rules/zh/performance.md +55 -0
  90. package/src/rules/zh/security.md +29 -0
  91. package/src/rules/zh/testing.md +29 -0
  92. package/src/skills/autonomous-agent-harness/SKILL.md +267 -0
  93. package/src/skills/autonomous-loops/SKILL.md +610 -0
  94. package/src/skills/bun-runtime/SKILL.md +84 -0
  95. package/src/skills/content-hash-cache-pattern/SKILL.md +161 -0
  96. package/src/skills/context-budget/SKILL.md +3 -3
  97. package/src/skills/continuous-learning-v2/SKILL.md +4 -4
  98. package/src/skills/continuous-learning-v2/agents/observer.md +1 -1
  99. package/src/skills/cost-aware-llm-pipeline/SKILL.md +183 -0
  100. package/src/skills/design-system/SKILL.md +82 -0
  101. package/src/skills/eval-harness/SKILL.md +270 -0
  102. package/src/skills/flutter-dart-code-review/SKILL.md +435 -0
  103. package/src/skills/gan-style-harness/SKILL.md +278 -0
  104. package/src/skills/git-workflow/SKILL.md +715 -0
  105. package/src/skills/hexagonal-architecture/SKILL.md +276 -0
  106. package/src/skills/iterative-retrieval/SKILL.md +211 -0
  107. package/src/skills/laravel-plugin-discovery/SKILL.md +229 -0
  108. package/src/skills/nextjs-turbopack/SKILL.md +44 -0
  109. package/src/skills/nuxt4-patterns/SKILL.md +100 -0
  110. package/src/skills/opensource-pipeline/SKILL.md +255 -0
  111. package/src/skills/perl-security/SKILL.md +503 -0
  112. package/src/skills/project-flow-ops/SKILL.md +111 -0
  113. package/src/skills/project-guidelines-example/SKILL.md +349 -0
  114. package/src/skills/prompt-optimizer/SKILL.md +38 -38
  115. package/src/skills/pytorch-patterns/SKILL.md +396 -0
  116. package/src/skills/regex-vs-llm-structured-text/SKILL.md +220 -0
  117. package/src/skills/repo-scan/SKILL.md +78 -0
  118. package/src/skills/rules-distill/SKILL.md +264 -0
  119. package/src/skills/rules-distill/scripts/scan-rules.sh +58 -0
  120. package/src/skills/rules-distill/scripts/scan-skills.sh +129 -0
  121. package/src/skills/swift-concurrency-6-2/SKILL.md +216 -0
  122. package/src/skills/token-budget-advisor/SKILL.md +133 -0
  123. package/src/skills/verification-loop/SKILL.md +1 -1
  124. package/src/skills/workspace-surface-audit/SKILL.md +125 -0
@@ -0,0 +1,270 @@
1
+ ---
2
+ name: eval-harness
3
+ description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
4
+ origin: ECC
5
+ tools: Read, Write, Edit, Bash, Grep, Glob
6
+ ---
7
+
8
+ # Eval Harness Skill
9
+
10
+ A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
11
+
12
+ ## When to Activate
13
+
14
+ - Setting up eval-driven development (EDD) for AI-assisted workflows
15
+ - Defining pass/fail criteria for Claude Code task completion
16
+ - Measuring agent reliability with pass@k metrics
17
+ - Creating regression test suites for prompt or agent changes
18
+ - Benchmarking agent performance across model versions
19
+
20
+ ## Philosophy
21
+
22
+ Eval-Driven Development treats evals as the "unit tests of AI development":
23
+ - Define expected behavior BEFORE implementation
24
+ - Run evals continuously during development
25
+ - Track regressions with each change
26
+ - Use pass@k metrics for reliability measurement
27
+
28
+ ## Eval Types
29
+
30
+ ### Capability Evals
31
+ Test if Claude can do something it couldn't before:
32
+ ```markdown
33
+ [CAPABILITY EVAL: feature-name]
34
+ Task: Description of what Claude should accomplish
35
+ Success Criteria:
36
+ - [ ] Criterion 1
37
+ - [ ] Criterion 2
38
+ - [ ] Criterion 3
39
+ Expected Output: Description of expected result
40
+ ```
41
+
42
+ ### Regression Evals
43
+ Ensure changes don't break existing functionality:
44
+ ```markdown
45
+ [REGRESSION EVAL: feature-name]
46
+ Baseline: SHA or checkpoint name
47
+ Tests:
48
+ - existing-test-1: PASS/FAIL
49
+ - existing-test-2: PASS/FAIL
50
+ - existing-test-3: PASS/FAIL
51
+ Result: X/Y passed (previously Y/Y)
52
+ ```
53
+
54
+ ## Grader Types
55
+
56
+ ### 1. Code-Based Grader
57
+ Deterministic checks using code:
58
+ ```bash
59
+ # Check if file contains expected pattern
60
+ grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
61
+
62
+ # Check if tests pass
63
+ npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
64
+
65
+ # Check if build succeeds
66
+ npm run build && echo "PASS" || echo "FAIL"
67
+ ```
68
+
69
+ ### 2. Model-Based Grader
70
+ Use Claude to evaluate open-ended outputs:
71
+ ```markdown
72
+ [MODEL GRADER PROMPT]
73
+ Evaluate the following code change:
74
+ 1. Does it solve the stated problem?
75
+ 2. Is it well-structured?
76
+ 3. Are edge cases handled?
77
+ 4. Is error handling appropriate?
78
+
79
+ Score: 1-5 (1=poor, 5=excellent)
80
+ Reasoning: [explanation]
81
+ ```
82
+
83
+ ### 3. Human Grader
84
+ Flag for manual review:
85
+ ```markdown
86
+ [HUMAN REVIEW REQUIRED]
87
+ Change: Description of what changed
88
+ Reason: Why human review is needed
89
+ Risk Level: LOW/MEDIUM/HIGH
90
+ ```
91
+
92
+ ## Metrics
93
+
94
+ ### pass@k
95
+ "At least one success in k attempts"
96
+ - pass@1: First attempt success rate
97
+ - pass@3: Success within 3 attempts
98
+ - Typical target: pass@3 > 90%
99
+
100
+ ### pass^k
101
+ "All k trials succeed"
102
+ - Higher bar for reliability
103
+ - pass^3: 3 consecutive successes
104
+ - Use for critical paths
105
+
106
+ ## Eval Workflow
107
+
108
+ ### 1. Define (Before Coding)
109
+ ```markdown
110
+ ## EVAL DEFINITION: feature-xyz
111
+
112
+ ### Capability Evals
113
+ 1. Can create new user account
114
+ 2. Can validate email format
115
+ 3. Can hash password securely
116
+
117
+ ### Regression Evals
118
+ 1. Existing login still works
119
+ 2. Session management unchanged
120
+ 3. Logout flow intact
121
+
122
+ ### Success Metrics
123
+ - pass@3 > 90% for capability evals
124
+ - pass^3 = 100% for regression evals
125
+ ```
126
+
127
+ ### 2. Implement
128
+ Write code to pass the defined evals.
129
+
130
+ ### 3. Evaluate
131
+ ```bash
132
+ # Run capability evals
133
+ [Run each capability eval, record PASS/FAIL]
134
+
135
+ # Run regression evals
136
+ npm test -- --testPathPattern="existing"
137
+
138
+ # Generate report
139
+ ```
140
+
141
+ ### 4. Report
142
+ ```markdown
143
+ EVAL REPORT: feature-xyz
144
+ ========================
145
+
146
+ Capability Evals:
147
+ create-user: PASS (pass@1)
148
+ validate-email: PASS (pass@2)
149
+ hash-password: PASS (pass@1)
150
+ Overall: 3/3 passed
151
+
152
+ Regression Evals:
153
+ login-flow: PASS
154
+ session-mgmt: PASS
155
+ logout-flow: PASS
156
+ Overall: 3/3 passed
157
+
158
+ Metrics:
159
+ pass@1: 67% (2/3)
160
+ pass@3: 100% (3/3)
161
+
162
+ Status: READY FOR REVIEW
163
+ ```
164
+
165
+ ## Integration Patterns
166
+
167
+ ### Pre-Implementation
168
+ ```
169
+ /eval define feature-name
170
+ ```
171
+ Creates eval definition file at `.claude/evals/feature-name.md`
172
+
173
+ ### During Implementation
174
+ ```
175
+ /eval check feature-name
176
+ ```
177
+ Runs current evals and reports status
178
+
179
+ ### Post-Implementation
180
+ ```
181
+ /eval report feature-name
182
+ ```
183
+ Generates full eval report
184
+
185
+ ## Eval Storage
186
+
187
+ Store evals in project:
188
+ ```
189
+ .claude/
190
+ evals/
191
+ feature-xyz.md # Eval definition
192
+ feature-xyz.log # Eval run history
193
+ baseline.json # Regression baselines
194
+ ```
195
+
196
+ ## Best Practices
197
+
198
+ 1. **Define evals BEFORE coding** - Forces clear thinking about success criteria
199
+ 2. **Run evals frequently** - Catch regressions early
200
+ 3. **Track pass@k over time** - Monitor reliability trends
201
+ 4. **Use code graders when possible** - Deterministic > probabilistic
202
+ 5. **Human review for security** - Never fully automate security checks
203
+ 6. **Keep evals fast** - Slow evals don't get run
204
+ 7. **Version evals with code** - Evals are first-class artifacts
205
+
206
+ ## Example: Adding Authentication
207
+
208
+ ```markdown
209
+ ## EVAL: add-authentication
210
+
211
+ ### Phase 1: Define (10 min)
212
+ Capability Evals:
213
+ - [ ] User can register with email/password
214
+ - [ ] User can login with valid credentials
215
+ - [ ] Invalid credentials rejected with proper error
216
+ - [ ] Sessions persist across page reloads
217
+ - [ ] Logout clears session
218
+
219
+ Regression Evals:
220
+ - [ ] Public routes still accessible
221
+ - [ ] API responses unchanged
222
+ - [ ] Database schema compatible
223
+
224
+ ### Phase 2: Implement (varies)
225
+ [Write code]
226
+
227
+ ### Phase 3: Evaluate
228
+ Run: /eval check add-authentication
229
+
230
+ ### Phase 4: Report
231
+ EVAL REPORT: add-authentication
232
+ ==============================
233
+ Capability: 5/5 passed (pass@3: 100%)
234
+ Regression: 3/3 passed (pass^3: 100%)
235
+ Status: SHIP IT
236
+ ```
237
+
238
+ ## Product Evals (v1.8)
239
+
240
+ Use product evals when behavior quality cannot be captured by unit tests alone.
241
+
242
+ ### Grader Types
243
+
244
+ 1. Code grader (deterministic assertions)
245
+ 2. Rule grader (regex/schema constraints)
246
+ 3. Model grader (LLM-as-judge rubric)
247
+ 4. Human grader (manual adjudication for ambiguous outputs)
248
+
249
+ ### pass@k Guidance
250
+
251
+ - `pass@1`: direct reliability
252
+ - `pass@3`: practical reliability under controlled retries
253
+ - `pass^3`: stability test (all 3 runs must pass)
254
+
255
+ Recommended thresholds:
256
+ - Capability evals: pass@3 >= 0.90
257
+ - Regression evals: pass^3 = 1.00 for release-critical paths
258
+
259
+ ### Eval Anti-Patterns
260
+
261
+ - Overfitting prompts to known eval examples
262
+ - Measuring only happy-path outputs
263
+ - Ignoring cost and latency drift while chasing pass rates
264
+ - Allowing flaky graders in release gates
265
+
266
+ ### Minimal Eval Artifact Layout
267
+
268
+ - `.claude/evals/<feature>.md` definition
269
+ - `.claude/evals/<feature>.log` run history
270
+ - `docs/releases/<version>/eval-summary.md` release snapshot