@boshu2/vibe-check 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (128) hide show
  1. package/.agents/bundles/ml-learning-loop-complete-plan-2025-11-28.md +908 -0
  2. package/.agents/bundles/unified-vibe-system-plan-phase1-2025-11-28.md +962 -0
  3. package/.agents/bundles/unified-vibe-system-research-2025-11-28.md +1003 -0
  4. package/.agents/bundles/vibe-check-ecosystem-plan-2025-11-29.md +635 -0
  5. package/.agents/bundles/vibe-check-gamification-complete-2025-11-29.md +132 -0
  6. package/.agents/bundles/vibe-score-scientific-framework-2025-11-28.md +602 -0
  7. package/.vibe-check/calibration.json +38 -0
  8. package/.vibe-check/latest.json +114 -0
  9. package/CHANGELOG.md +46 -0
  10. package/CLAUDE.md +178 -0
  11. package/README.md +265 -63
  12. package/action.yml +270 -0
  13. package/dashboard/app.js +494 -0
  14. package/dashboard/index.html +235 -0
  15. package/dashboard/styles.css +647 -0
  16. package/dist/calibration/ece.d.ts +26 -0
  17. package/dist/calibration/ece.d.ts.map +1 -0
  18. package/dist/calibration/ece.js +93 -0
  19. package/dist/calibration/ece.js.map +1 -0
  20. package/dist/calibration/index.d.ts +3 -0
  21. package/dist/calibration/index.d.ts.map +1 -0
  22. package/dist/calibration/index.js +15 -0
  23. package/dist/calibration/index.js.map +1 -0
  24. package/dist/calibration/storage.d.ts +34 -0
  25. package/dist/calibration/storage.d.ts.map +1 -0
  26. package/dist/calibration/storage.js +188 -0
  27. package/dist/calibration/storage.js.map +1 -0
  28. package/dist/cli.js +30 -76
  29. package/dist/cli.js.map +1 -1
  30. package/dist/commands/analyze.d.ts +16 -0
  31. package/dist/commands/analyze.d.ts.map +1 -0
  32. package/dist/commands/analyze.js +256 -0
  33. package/dist/commands/analyze.js.map +1 -0
  34. package/dist/commands/index.d.ts +4 -0
  35. package/dist/commands/index.d.ts.map +1 -0
  36. package/dist/commands/index.js +11 -0
  37. package/dist/commands/index.js.map +1 -0
  38. package/dist/commands/level.d.ts +3 -0
  39. package/dist/commands/level.d.ts.map +1 -0
  40. package/dist/commands/level.js +277 -0
  41. package/dist/commands/level.js.map +1 -0
  42. package/dist/commands/profile.d.ts +4 -0
  43. package/dist/commands/profile.d.ts.map +1 -0
  44. package/dist/commands/profile.js +143 -0
  45. package/dist/commands/profile.js.map +1 -0
  46. package/dist/gamification/achievements.d.ts +15 -0
  47. package/dist/gamification/achievements.d.ts.map +1 -0
  48. package/dist/gamification/achievements.js +273 -0
  49. package/dist/gamification/achievements.js.map +1 -0
  50. package/dist/gamification/index.d.ts +8 -0
  51. package/dist/gamification/index.d.ts.map +1 -0
  52. package/dist/gamification/index.js +30 -0
  53. package/dist/gamification/index.js.map +1 -0
  54. package/dist/gamification/profile.d.ts +46 -0
  55. package/dist/gamification/profile.d.ts.map +1 -0
  56. package/dist/gamification/profile.js +272 -0
  57. package/dist/gamification/profile.js.map +1 -0
  58. package/dist/gamification/streaks.d.ts +26 -0
  59. package/dist/gamification/streaks.d.ts.map +1 -0
  60. package/dist/gamification/streaks.js +132 -0
  61. package/dist/gamification/streaks.js.map +1 -0
  62. package/dist/gamification/types.d.ts +111 -0
  63. package/dist/gamification/types.d.ts.map +1 -0
  64. package/dist/gamification/types.js +26 -0
  65. package/dist/gamification/types.js.map +1 -0
  66. package/dist/gamification/xp.d.ts +37 -0
  67. package/dist/gamification/xp.d.ts.map +1 -0
  68. package/dist/gamification/xp.js +115 -0
  69. package/dist/gamification/xp.js.map +1 -0
  70. package/dist/git.d.ts +11 -0
  71. package/dist/git.d.ts.map +1 -1
  72. package/dist/git.js +52 -0
  73. package/dist/git.js.map +1 -1
  74. package/dist/metrics/code-stability.d.ts +13 -0
  75. package/dist/metrics/code-stability.d.ts.map +1 -0
  76. package/dist/metrics/code-stability.js +74 -0
  77. package/dist/metrics/code-stability.js.map +1 -0
  78. package/dist/metrics/file-churn.d.ts +8 -0
  79. package/dist/metrics/file-churn.d.ts.map +1 -0
  80. package/dist/metrics/file-churn.js +75 -0
  81. package/dist/metrics/file-churn.js.map +1 -0
  82. package/dist/metrics/time-spiral.d.ts +8 -0
  83. package/dist/metrics/time-spiral.d.ts.map +1 -0
  84. package/dist/metrics/time-spiral.js +69 -0
  85. package/dist/metrics/time-spiral.js.map +1 -0
  86. package/dist/metrics/velocity-anomaly.d.ts +13 -0
  87. package/dist/metrics/velocity-anomaly.d.ts.map +1 -0
  88. package/dist/metrics/velocity-anomaly.js +67 -0
  89. package/dist/metrics/velocity-anomaly.js.map +1 -0
  90. package/dist/output/index.d.ts +6 -3
  91. package/dist/output/index.d.ts.map +1 -1
  92. package/dist/output/index.js +4 -3
  93. package/dist/output/index.js.map +1 -1
  94. package/dist/output/json.d.ts +2 -2
  95. package/dist/output/json.d.ts.map +1 -1
  96. package/dist/output/json.js +54 -0
  97. package/dist/output/json.js.map +1 -1
  98. package/dist/output/markdown.d.ts +2 -2
  99. package/dist/output/markdown.d.ts.map +1 -1
  100. package/dist/output/markdown.js +34 -1
  101. package/dist/output/markdown.js.map +1 -1
  102. package/dist/output/terminal.d.ts +6 -2
  103. package/dist/output/terminal.d.ts.map +1 -1
  104. package/dist/output/terminal.js +131 -3
  105. package/dist/output/terminal.js.map +1 -1
  106. package/dist/recommend/index.d.ts +3 -0
  107. package/dist/recommend/index.d.ts.map +1 -0
  108. package/dist/recommend/index.js +14 -0
  109. package/dist/recommend/index.js.map +1 -0
  110. package/dist/recommend/ordered-logistic.d.ts +49 -0
  111. package/dist/recommend/ordered-logistic.d.ts.map +1 -0
  112. package/dist/recommend/ordered-logistic.js +153 -0
  113. package/dist/recommend/ordered-logistic.js.map +1 -0
  114. package/dist/recommend/questions.d.ts +19 -0
  115. package/dist/recommend/questions.d.ts.map +1 -0
  116. package/dist/recommend/questions.js +73 -0
  117. package/dist/recommend/questions.js.map +1 -0
  118. package/dist/score/index.d.ts +21 -0
  119. package/dist/score/index.d.ts.map +1 -0
  120. package/dist/score/index.js +48 -0
  121. package/dist/score/index.js.map +1 -0
  122. package/dist/score/weights.d.ts +16 -0
  123. package/dist/score/weights.d.ts.map +1 -0
  124. package/dist/score/weights.js +28 -0
  125. package/dist/score/weights.js.map +1 -0
  126. package/dist/types.d.ts +83 -0
  127. package/dist/types.d.ts.map +1 -1
  128. package/package.json +10 -9
@@ -0,0 +1,1003 @@
1
+ # Unified Vibe System: Scientific Framework
2
+
3
+ **Type:** Research (Merged)
4
+ **Created:** 2025-11-28
5
+ **Loop:** Outer (foundational architecture)
6
+ **Tags:** vibe-coding, vibe-levels, scoring-algorithm, machine-learning, scientific-validation, feedback-loop
7
+ **Sources:**
8
+ - `vibe-level-scoring-algorithm-research-2025-11-28.md`
9
+ - `vibe-score-scientific-framework-2025-11-28.md`
10
+
11
+ ---
12
+
13
+ ## Executive Summary
14
+
15
+ A complete, scientifically rigorous system for vibe-coding that:
16
+
17
+ 1. **Recommends** appropriate trust level (0-5) before work starts
18
+ 2. **Measures** actual coding health from git history (0.0-1.0 score)
19
+ 3. **Calibrates** both models via feedback loop
20
+ 4. **Works universally** - no semantic commits required
21
+
22
+ **Key innovation:** Two complementary algorithms that feed each other, creating a self-improving system.
23
+
24
+ ---
25
+
26
+ ## System Architecture
27
+
28
+ ```
29
+ ┌─────────────────────────────────────────────────────────────────────────┐
30
+ │ UNIFIED VIBE SYSTEM │
31
+ ├─────────────────────────────────────────────────────────────────────────┤
32
+ │ │
33
+ │ ┌───────────────────────────────────────────────────────────────────┐ │
34
+ │ │ PHASE 1: PRE-SESSION │ │
35
+ │ │ │ │
36
+ │ │ User Input Level Recommender │ │
37
+ │ │ ────────── ───────────────── │ │
38
+ │ │ 5 Questions: Algorithm: Ordered Logistic │ │
39
+ │ │ • Reversibility Inputs: Questions + History │ │
40
+ │ │ • Blast radius Output: Level 0-5 + Confidence │ │
41
+ │ │ • Verification cost │ │
42
+ │ │ • Domain complexity "Recommend Level 3 (82% conf)" │ │
43
+ │ │ • AI track record │ │
44
+ │ │ │ │
45
+ │ └───────────────────────────────────────────────────────────────────┘ │
46
+ │ │ │
47
+ │ ▼ │
48
+ │ ┌───────────────────────────────────────────────────────────────────┐ │
49
+ │ │ PHASE 2: WORK SESSION │ │
50
+ │ │ │ │
51
+ │ │ Developer/AI works at recommended trust level │ │
52
+ │ │ Git commits accumulate with natural patterns │ │
53
+ │ │ │ │
54
+ │ └───────────────────────────────────────────────────────────────────┘ │
55
+ │ │ │
56
+ │ ▼ │
57
+ │ ┌───────────────────────────────────────────────────────────────────┐ │
58
+ │ │ PHASE 3: POST-SESSION │ │
59
+ │ │ │ │
60
+ │ │ Git History Vibe Score Calculator │ │
61
+ │ │ ─────────── ───────────────────── │ │
62
+ │ │ • File changes Algorithm: Weighted Composite │ │
63
+ │ │ • Timestamps Inputs: Pure git signals │ │
64
+ │ │ • Patterns Output: Score 0.0-1.0 │ │
65
+ │ │ │ │
66
+ │ │ NO SEMANTIC COMMITS NEEDED "Session score: 0.72" │ │
67
+ │ │ │ │
68
+ │ └───────────────────────────────────────────────────────────────────┘ │
69
+ │ │ │
70
+ │ ▼ │
71
+ │ ┌───────────────────────────────────────────────────────────────────┐ │
72
+ │ │ PHASE 4: CALIBRATION │ │
73
+ │ │ │ │
74
+ │ │ Calibration Engine │ │
75
+ │ │ ────────────────── │ │
76
+ │ │ Compare: Score vs Expected for Level │ │
77
+ │ │ • Level 3 expected: 0.65-0.80 │ │
78
+ │ │ • Actual score: 0.72 │ │
79
+ │ │ • Assessment: CORRECT ✓ │ │
80
+ │ │ │ │
81
+ │ │ Update both models: │ │
82
+ │ │ • Level Recommender weights (Bayesian posterior) │ │
83
+ │ │ • Vibe Score weights (ECE optimization) │ │
84
+ │ │ │ │
85
+ │ └───────────────────────────────────────────────────────────────────┘ │
86
+ │ │ │
87
+ │ ▼ │
88
+ │ IMPROVED MODELS FOR NEXT SESSION │
89
+ │ │
90
+ └─────────────────────────────────────────────────────────────────────────┘
91
+ ```
92
+
93
+ ---
94
+
95
+ ## Component 1: Level Recommender
96
+
97
+ ### Purpose
98
+ Recommend appropriate trust level (0-5) **before** work begins.
99
+
100
+ ### Algorithm: Ordered Logistic Regression
101
+
102
+ **Why ordered (not multinomial):**
103
+ - Respects ordinal structure: 0 < 1 < 2 < 3 < 4 < 5
104
+ - Level 2 is "between" 1 and 3 (not just a different category)
105
+ - Fewer parameters, more stable with small data
106
+ - Gold standard in statistics for rating scales
107
+
108
+ **Mathematical Model:**
109
+ ```
110
+ P(Y ≤ k) = σ(θₖ - Xβ)
111
+
112
+ Where:
113
+ σ = sigmoid function
114
+ θₖ = threshold for level k (5 thresholds learned)
115
+ X = input features (14 total)
116
+ β = feature weights (14 weights learned)
117
+ ```
118
+
119
+ ### Input Features (14 total)
120
+
121
+ **5 Questions (user-provided):**
122
+ | Question | Range | Meaning |
123
+ |----------|-------|---------|
124
+ | Q1: Reversibility | -2 to +1 | Can we undo mistakes? |
125
+ | Q2: Blast radius | -2 to +1 | How much breaks if wrong? |
126
+ | Q3: Verification cost | -2 to +1 | How hard to check correctness? |
127
+ | Q4: Domain complexity | -2 to +1 | How novel is this domain? |
128
+ | Q5: AI track record | -2 to +1 | Historical success in this area? |
129
+
130
+ **5 Current Metrics (from existing vibe-check):**
131
+ | Metric | Range | Source |
132
+ |--------|-------|--------|
133
+ | Trust Pass Rate | 0-100% | Semantic commits |
134
+ | Rework Ratio | 0-100% | Semantic commits |
135
+ | Debug Spiral Duration | 0-∞ min | Semantic commits |
136
+ | Flow Efficiency | 0-100% | Derived |
137
+ | Iteration Velocity | 0-∞ commits/hr | Timestamps |
138
+
139
+ **4 New Metrics (semantic-commit-free):**
140
+ | Metric | Range | Source |
141
+ |--------|-------|--------|
142
+ | File Churn Score | 0-1 | File touch patterns |
143
+ | Time Spiral Score | 0-1 | Commit timing clusters |
144
+ | Velocity Anomaly Score | 0-1 | Z-score vs baseline |
145
+ | Code Stability Score | 0-1 | Line survival analysis |
146
+
147
+ ### Output
148
+ ```typescript
149
+ interface LevelRecommendation {
150
+ level: 0 | 1 | 2 | 3 | 4 | 5;
151
+ confidence: number; // 0-1, max probability
152
+ probabilities: number[]; // [p0, p1, p2, p3, p4, p5]
153
+ credibleInterval: [number, number]; // 95% CI
154
+ }
155
+
156
+ // Example output:
157
+ {
158
+ level: 3,
159
+ confidence: 0.82,
160
+ probabilities: [0.01, 0.03, 0.08, 0.82, 0.05, 0.01],
161
+ credibleInterval: [2.4, 3.6]
162
+ }
163
+ ```
164
+
165
+ ### Implementation
166
+
167
+ ```typescript
168
+ import { Matrix } from 'ml-matrix';
169
+
170
+ interface OrderedLogisticModel {
171
+ weights: number[]; // 14 feature weights
172
+ thresholds: number[]; // 5 level thresholds
173
+ priorMeans: number[]; // Bayesian prior means
174
+ priorVariances: number[]; // Bayesian prior variances
175
+ }
176
+
177
+ function recommendLevel(
178
+ questions: number[], // 5 values
179
+ currentMetrics: number[], // 5 values
180
+ newMetrics: number[], // 4 values
181
+ model: OrderedLogisticModel
182
+ ): LevelRecommendation {
183
+ // Combine all features
184
+ const features = [...questions, ...currentMetrics, ...newMetrics];
185
+
186
+ // Compute linear predictor
187
+ const eta = dotProduct(features, model.weights);
188
+
189
+ // Compute cumulative probabilities
190
+ const cumProbs = model.thresholds.map(t => sigmoid(t - eta));
191
+
192
+ // Convert to level probabilities
193
+ const probs = [
194
+ cumProbs[0],
195
+ cumProbs[1] - cumProbs[0],
196
+ cumProbs[2] - cumProbs[1],
197
+ cumProbs[3] - cumProbs[2],
198
+ cumProbs[4] - cumProbs[3],
199
+ 1 - cumProbs[4]
200
+ ];
201
+
202
+ // Find most likely level
203
+ const level = probs.indexOf(Math.max(...probs)) as 0|1|2|3|4|5;
204
+ const confidence = Math.max(...probs);
205
+
206
+ // Compute credible interval from probability distribution
207
+ const mean = probs.reduce((sum, p, i) => sum + p * i, 0);
208
+ const variance = probs.reduce((sum, p, i) => sum + p * (i - mean) ** 2, 0);
209
+ const ci: [number, number] = [
210
+ Math.max(0, mean - 1.96 * Math.sqrt(variance)),
211
+ Math.min(5, mean + 1.96 * Math.sqrt(variance))
212
+ ];
213
+
214
+ return { level, confidence, probabilities: probs, credibleInterval: ci };
215
+ }
216
+ ```
217
+
218
+ ---
219
+
220
+ ## Component 2: Vibe Score Calculator
221
+
222
+ ### Purpose
223
+ Measure actual coding health **after** work session, using pure git signals (no semantic commits needed).
224
+
225
+ ### Algorithm: Weighted Composite Score
226
+
227
+ **Core Formula:**
228
+ ```
229
+ VibeScore = w₁×FileChurn + w₂×TimeSpiral + w₃×VelocityAnomaly + w₄×CodeStability
230
+
231
+ Where all component scores are 0-1 (higher = better)
232
+ Initial weights: w₁=0.30, w₂=0.25, w₃=0.20, w₄=0.25
233
+ Weights update via ECE optimization
234
+ ```
235
+
236
+ ### Component Metrics
237
+
238
+ #### 1. File Churn Score (0-1)
239
+
240
+ **What it measures:** Did code stick on first touch?
241
+
242
+ **Algorithm:**
243
+ ```typescript
244
+ function calculateFileChurnScore(commits: CommitWithFiles[]): number {
245
+ const fileTimestamps = new Map<string, Date[]>();
246
+
247
+ // Collect all touch timestamps per file
248
+ for (const commit of commits) {
249
+ for (const file of commit.files) {
250
+ const times = fileTimestamps.get(file) || [];
251
+ times.push(commit.date);
252
+ fileTimestamps.set(file, times);
253
+ }
254
+ }
255
+
256
+ let churnedFiles = 0;
257
+ const ONE_HOUR = 60 * 60 * 1000;
258
+
259
+ for (const [file, times] of fileTimestamps) {
260
+ const sorted = times.sort((a, b) => a.getTime() - b.getTime());
261
+
262
+ // Detect 3+ touches within 1 hour (spiral indicator)
263
+ for (let i = 0; i < sorted.length - 2; i++) {
264
+ const span = sorted[i + 2].getTime() - sorted[i].getTime();
265
+ if (span < ONE_HOUR) {
266
+ churnedFiles++;
267
+ break;
268
+ }
269
+ }
270
+ }
271
+
272
+ const churnRatio = fileTimestamps.size > 0
273
+ ? churnedFiles / fileTimestamps.size
274
+ : 0;
275
+
276
+ return 1 - churnRatio; // Invert: high score = low churn = good
277
+ }
278
+ ```
279
+
280
+ **Thresholds:**
281
+ | Churn Ratio | Score | Interpretation |
282
+ |-------------|-------|----------------|
283
+ | <10% | 0.90-1.0 | Elite - code sticks |
284
+ | 10-25% | 0.75-0.90 | High - minor iteration |
285
+ | 25-40% | 0.60-0.75 | Medium - notable thrashing |
286
+ | >40% | <0.60 | Low - significant spiral |
287
+
288
+ #### 2. Time Spiral Score (0-1)
289
+
290
+ **What it measures:** Are commits clustered in frustrated bursts?
291
+
292
+ **Algorithm:**
293
+ ```typescript
294
+ function calculateTimeSpiralScore(commits: Commit[]): number {
295
+ if (commits.length < 2) return 1.0;
296
+
297
+ const sorted = [...commits].sort((a, b) =>
298
+ a.date.getTime() - b.date.getTime()
299
+ );
300
+
301
+ let spiralCommits = 0;
302
+ const FIVE_MINUTES = 5 * 60 * 1000;
303
+
304
+ for (let i = 1; i < sorted.length; i++) {
305
+ const gap = sorted[i].date.getTime() - sorted[i - 1].date.getTime();
306
+ if (gap < FIVE_MINUTES) {
307
+ spiralCommits++;
308
+ }
309
+ }
310
+
311
+ const spiralRatio = spiralCommits / (commits.length - 1);
312
+ return 1 - spiralRatio; // Invert: high score = few spirals = good
313
+ }
314
+ ```
315
+
316
+ **Why 5 minutes?** Research shows productive commits average 15-30 min apart. <5 min typically indicates trial-and-error debugging.
317
+
318
+ #### 3. Velocity Anomaly Score (0-1)
319
+
320
+ **What it measures:** Is this pattern abnormal for this developer?
321
+
322
+ **Algorithm:**
323
+ ```typescript
324
+ interface Baseline {
325
+ mean: number; // commits/hour historical average
326
+ stdDev: number; // historical standard deviation
327
+ }
328
+
329
+ function calculateVelocityAnomalyScore(
330
+ commits: Commit[],
331
+ baseline: Baseline
332
+ ): number {
333
+ const hours = calculateActiveHours(commits);
334
+ const currentVelocity = hours > 0 ? commits.length / hours : 0;
335
+
336
+ // Z-score: how many std devs from personal mean
337
+ const zScore = baseline.stdDev > 0
338
+ ? Math.abs((currentVelocity - baseline.mean) / baseline.stdDev)
339
+ : 0;
340
+
341
+ // Sigmoid transform: z=0 → 1.0, z=2 → 0.12, z=3 → 0.05
342
+ return 1 / (1 + Math.exp(zScore - 1.5));
343
+ }
344
+ ```
345
+
346
+ **Why personal baseline?** Developers have different natural velocities. Anomaly detection catches *relative* changes, not arbitrary thresholds.
347
+
348
+ #### 4. Code Stability Score (0-1)
349
+
350
+ **What it measures:** How long do added lines survive?
351
+
352
+ **Algorithm:**
353
+ ```typescript
354
+ async function calculateCodeStabilityScore(
355
+ commits: Commit[],
356
+ repo: Repository
357
+ ): Promise<number> {
358
+ // Only analyze commits from last 7 days with 24h+ age
359
+ const cutoff = Date.now() - 24 * 60 * 60 * 1000;
360
+ const eligibleCommits = commits.filter(c =>
361
+ c.date.getTime() < cutoff
362
+ );
363
+
364
+ if (eligibleCommits.length === 0) return 1.0;
365
+
366
+ let totalAdded = 0;
367
+ let totalSurviving = 0;
368
+
369
+ for (const commit of eligibleCommits) {
370
+ const stats = await getCommitStats(repo, commit.hash);
371
+ totalAdded += stats.additions;
372
+
373
+ // Check how many lines from this commit still exist in HEAD
374
+ const surviving = await countSurvivingLines(repo, commit.hash);
375
+ totalSurviving += surviving;
376
+ }
377
+
378
+ return totalAdded > 0 ? totalSurviving / totalAdded : 1.0;
379
+ }
380
+ ```
381
+
382
+ **Note:** This metric requires git blame analysis, which is more compute-intensive. It can be made optional or async.
383
+
384
+ ### Composite Score
385
+
386
+ ```typescript
387
+ interface VibeScoreResult {
388
+ score: number; // 0.0-1.0 composite
389
+ components: {
390
+ fileChurn: number;
391
+ timeSpiral: number;
392
+ velocityAnomaly: number;
393
+ codeStability: number;
394
+ };
395
+ weights: number[]; // Current weights
396
+ interpretation: 'elite' | 'high' | 'medium' | 'low';
397
+ }
398
+
399
+ function calculateVibeScore(
400
+ commits: CommitWithFiles[],
401
+ baseline: Baseline,
402
+ repo: Repository,
403
+ weights: number[] = [0.30, 0.25, 0.20, 0.25]
404
+ ): VibeScoreResult {
405
+ const components = {
406
+ fileChurn: calculateFileChurnScore(commits),
407
+ timeSpiral: calculateTimeSpiralScore(commits),
408
+ velocityAnomaly: calculateVelocityAnomalyScore(commits, baseline),
409
+ codeStability: await calculateCodeStabilityScore(commits, repo)
410
+ };
411
+
412
+ const score =
413
+ weights[0] * components.fileChurn +
414
+ weights[1] * components.timeSpiral +
415
+ weights[2] * components.velocityAnomaly +
416
+ weights[3] * components.codeStability;
417
+
418
+ const interpretation =
419
+ score >= 0.85 ? 'elite' :
420
+ score >= 0.70 ? 'high' :
421
+ score >= 0.50 ? 'medium' : 'low';
422
+
423
+ return { score, components, weights, interpretation };
424
+ }
425
+ ```
426
+
427
+ ---
428
+
429
+ ## Component 3: Calibration Engine
430
+
431
+ ### Purpose
432
+ Compare predicted level against actual score, update both models.
433
+
434
+ ### Expected Score Ranges by Level
435
+
436
+ | Vibe Level | Trust | Expected Score | Interpretation |
437
+ |------------|-------|----------------|----------------|
438
+ | 5 | 95% | 0.90-1.00 | Near-perfect flow |
439
+ | 4 | 80% | 0.80-0.92 | Occasional minor fixes |
440
+ | 3 | 60% | 0.65-0.82 | Some iteration normal |
441
+ | 2 | 40% | 0.50-0.70 | Expect rework cycles |
442
+ | 1 | 20% | 0.30-0.55 | Heavy iteration expected |
443
+ | 0 | 0% | 0.00-0.40 | Exploration/research mode |
444
+
445
+ ### Outcome Assessment
446
+
447
+ ```typescript
448
+ type Outcome = 'correct' | 'too_high' | 'too_low';
449
+
450
+ interface ExpectedRange {
451
+ min: number;
452
+ max: number;
453
+ }
454
+
455
+ const EXPECTED_RANGES: Record<number, ExpectedRange> = {
456
+ 5: { min: 0.90, max: 1.00 },
457
+ 4: { min: 0.80, max: 0.92 },
458
+ 3: { min: 0.65, max: 0.82 },
459
+ 2: { min: 0.50, max: 0.70 },
460
+ 1: { min: 0.30, max: 0.55 },
461
+ 0: { min: 0.00, max: 0.40 },
462
+ };
463
+
464
+ function assessOutcome(
465
+ recommendedLevel: number,
466
+ actualScore: number
467
+ ): Outcome {
468
+ const expected = EXPECTED_RANGES[recommendedLevel];
469
+
470
+ if (actualScore >= expected.min && actualScore <= expected.max) {
471
+ return 'correct';
472
+ } else if (actualScore > expected.max) {
473
+ // Score higher than expected = level was too conservative
474
+ return 'too_low';
475
+ } else {
476
+ // Score lower than expected = level was too aggressive
477
+ return 'too_high';
478
+ }
479
+ }
480
+ ```
481
+
482
+ ### Model Updates
483
+
484
+ #### Level Recommender Update (Bayesian)
485
+
486
+ ```typescript
487
+ function updateLevelRecommender(
488
+ model: OrderedLogisticModel,
489
+ features: number[],
490
+ recommendedLevel: number,
491
+ outcome: Outcome,
492
+ learningRate: number = 0.05
493
+ ): OrderedLogisticModel {
494
+ // Determine "true" level based on outcome
495
+ const trueLevel =
496
+ outcome === 'correct' ? recommendedLevel :
497
+ outcome === 'too_high' ? recommendedLevel - 1 :
498
+ recommendedLevel + 1;
499
+
500
+ // Clamp to valid range
501
+ const clampedTrue = Math.max(0, Math.min(5, trueLevel));
502
+
503
+ // Compute gradient of ordinal cross-entropy loss
504
+ const predicted = predictProbabilities(model, features);
505
+ const gradient = computeOrdinalGradient(predicted, clampedTrue, features);
506
+
507
+ // Update weights with Bayesian regularization toward prior
508
+ const newWeights = model.weights.map((w, i) => {
509
+ const priorPull = (model.priorMeans[i] - w) * 0.01; // Regularization
510
+ return w - learningRate * gradient.weights[i] + priorPull;
511
+ });
512
+
513
+ // Update thresholds
514
+ const newThresholds = model.thresholds.map((t, i) =>
515
+ t - learningRate * gradient.thresholds[i]
516
+ );
517
+
518
+ return {
519
+ ...model,
520
+ weights: newWeights,
521
+ thresholds: newThresholds
522
+ };
523
+ }
524
+ ```
525
+
526
+ #### Vibe Score Update (ECE Optimization)
527
+
528
+ ```typescript
529
+ interface CalibrationSample {
530
+ level: number;
531
+ score: number;
532
+ timestamp: Date;
533
+ }
534
+
535
+ function updateVibeScoreWeights(
536
+ currentWeights: number[],
537
+ history: CalibrationSample[],
538
+ learningRate: number = 0.1
539
+ ): number[] {
540
+ // Group by level
541
+ const bins = new Map<number, number[]>();
542
+ for (const sample of history) {
543
+ const scores = bins.get(sample.level) || [];
544
+ scores.push(sample.score);
545
+ bins.set(sample.level, scores);
546
+ }
547
+
548
+ // Calculate Expected Calibration Error
549
+ let ece = 0;
550
+ let totalSamples = 0;
551
+
552
+ for (const [level, scores] of bins) {
553
+ const expected = EXPECTED_RANGES[level];
554
+ const expectedCenter = (expected.min + expected.max) / 2;
555
+ const actualMean = scores.reduce((a, b) => a + b, 0) / scores.length;
556
+
557
+ ece += scores.length * Math.abs(actualMean - expectedCenter);
558
+ totalSamples += scores.length;
559
+ }
560
+ ece /= totalSamples;
561
+
562
+ // If ECE > 0.10, adjust weights
563
+ if (ece > 0.10) {
564
+ // Simplified: increase weight of most predictive component
565
+ // Full version: compute gradient of ECE w.r.t. weights
566
+ const adjustments = computeECEGradient(history, currentWeights);
567
+
568
+ return currentWeights.map((w, i) => {
569
+ const newW = w - learningRate * adjustments[i];
570
+ return Math.max(0.1, Math.min(0.5, newW)); // Clamp to reasonable range
571
+ });
572
+ }
573
+
574
+ return currentWeights;
575
+ }
576
+ ```
577
+
578
+ ### Calibration Monitoring
579
+
580
+ **Target:** Expected Calibration Error (ECE) < 0.10
581
+
582
+ ```typescript
583
+ function calculateECE(history: CalibrationSample[]): number {
584
+ const bins = groupByLevel(history);
585
+ let ece = 0;
586
+ let total = 0;
587
+
588
+ for (const [level, scores] of bins) {
589
+ const expected = (EXPECTED_RANGES[level].min + EXPECTED_RANGES[level].max) / 2;
590
+ const actual = mean(scores);
591
+ ece += scores.length * Math.abs(actual - expected);
592
+ total += scores.length;
593
+ }
594
+
595
+ return total > 0 ? ece / total : 0;
596
+ }
597
+
598
+ function generateReliabilityDiagram(history: CalibrationSample[]): void {
599
+ // For each level, plot expected vs actual score
600
+ // Perfect calibration = diagonal line
601
+ console.log('Level | Expected | Actual | Gap');
602
+ console.log('------|----------|--------|----');
603
+
604
+ for (let level = 0; level <= 5; level++) {
605
+ const samples = history.filter(s => s.level === level);
606
+ if (samples.length === 0) continue;
607
+
608
+ const expected = (EXPECTED_RANGES[level].min + EXPECTED_RANGES[level].max) / 2;
609
+ const actual = mean(samples.map(s => s.score));
610
+ const gap = Math.abs(expected - actual);
611
+
612
+ console.log(` ${level} | ${expected.toFixed(2)} | ${actual.toFixed(2)} | ${gap.toFixed(2)}`);
613
+ }
614
+ }
615
+ ```
616
+
617
+ ---
618
+
619
+ ## Data Schema
620
+
621
+ ### Session Record
622
+
623
+ ```typescript
624
+ interface SessionRecord {
625
+ // Identity
626
+ sessionId: string;
627
+ timestamp: Date;
628
+ developerId: string; // Anonymized hash
629
+
630
+ // Pre-session input
631
+ questions: {
632
+ reversibility: number; // -2 to +1
633
+ blastRadius: number; // -2 to +1
634
+ verificationCost: number; // -2 to +1
635
+ domainComplexity: number; // -2 to +1
636
+ aiTrackRecord: number; // -2 to +1
637
+ };
638
+
639
+ // Recommendation
640
+ recommendedLevel: number;
641
+ levelConfidence: number;
642
+ actualLevelUsed: number;
643
+
644
+ // Post-session metrics (semantic-based)
645
+ semanticMetrics: {
646
+ trustPassRate: number;
647
+ reworkRatio: number;
648
+ debugSpiralDuration: number;
649
+ flowEfficiency: number;
650
+ iterationVelocity: number;
651
+ };
652
+
653
+ // Post-session metrics (semantic-free)
654
+ gitMetrics: {
655
+ fileChurnScore: number;
656
+ timeSpiralScore: number;
657
+ velocityAnomalyScore: number;
658
+ codeStabilityScore: number;
659
+ };
660
+
661
+ // Computed outcomes
662
+ vibeScore: number;
663
+ outcome: 'correct' | 'too_high' | 'too_low';
664
+
665
+ // Model state
666
+ modelVersion: string;
667
+ recommenderWeights: number[];
668
+ scoreWeights: number[];
669
+ }
670
+ ```
671
+
672
+ ### Storage
673
+
674
+ ```typescript
675
+ // Option A: JSONL file (simple, portable)
676
+ // .vibe-check/sessions.jsonl
677
+
678
+ // Option B: SQLite (queryable, efficient)
679
+ // .vibe-check/vibe.db
680
+
681
+ // Schema for SQLite:
682
+ const SCHEMA = `
683
+ CREATE TABLE sessions (
684
+ id TEXT PRIMARY KEY,
685
+ timestamp TEXT,
686
+ developer_id TEXT,
687
+ recommended_level INTEGER,
688
+ vibe_score REAL,
689
+ outcome TEXT,
690
+ model_version TEXT,
691
+ data JSON
692
+ );
693
+
694
+ CREATE INDEX idx_developer ON sessions(developer_id);
695
+ CREATE INDEX idx_timestamp ON sessions(timestamp);
696
+ CREATE INDEX idx_outcome ON sessions(outcome);
697
+ `;
698
+ ```
699
+
700
+ ---
701
+
702
+ ## Validation Protocol
703
+
704
+ ### Anthropic-Grade Standards
705
+
706
+ 1. **Report SEM (Standard Error of Mean)**
707
+ ```
708
+ Accuracy: 82% ± 3.2% (n=64)
709
+ ```
710
+
711
+ 2. **Use Paired-Difference Analysis**
712
+ - Compare model vs baseline on SAME sessions
713
+ - Reduces variance in estimates
714
+
715
+ 3. **Clustered Standard Errors**
716
+ - Sessions from same developer are correlated
717
+ - Naive SE underestimates true uncertainty by ~3x
718
+
719
+ 4. **Power Analysis Before Experiment**
720
+ - For d=0.5 effect, α=0.05, power=0.80: n≈64 per group
721
+
722
+ 5. **95% Confidence Intervals**
723
+ - `CI = mean ± 1.96 × SEM`
724
+
725
+ ### Validation Metrics
726
+
727
+ | Metric | Target | How to Measure |
728
+ |--------|--------|----------------|
729
+ | Level Accuracy | >80% | Correct level / total |
730
+ | Cohen's Kappa | >0.60 | Agreement with expert |
731
+ | Mean Absolute Error | <0.8 levels | Avg distance from true |
732
+ | Calibration (ECE) | <0.10 | Predicted vs actual |
733
+ | Brier Score | <0.15 | Probability calibration |
734
+ | Domain Consistency | <10% gap | Performance across domains |
735
+
736
+ ### Study Phases
737
+
738
+ ```
739
+ PHASE 1: Bootstrap (Weeks 1-4)
740
+ ├── Collect 50+ sessions without ML recommendations
741
+ ├── Use simple formula for level (current approach)
742
+ ├── Compute all metrics for each session
743
+ └── Establish baseline accuracy
744
+
745
+ PHASE 2: Learned Weights (Weeks 5-8)
746
+ ├── Train Ordered Logistic on collected data
747
+ ├── Split 70/20/10 (train/val/test)
748
+ ├── Measure improvement over baseline
749
+ └── Report accuracy ± SEM
750
+
751
+ PHASE 3: A/B Test (Weeks 9-16)
752
+ ├── Control: Simple formula recommendations
753
+ ├── Treatment: ML-based recommendations
754
+ ├── Measure: Does ML improve Trust Pass Rate?
755
+ └── Statistical test: Paired t-test, report Cohen's d
756
+
757
+ PHASE 4: Feedback Loop (Weeks 17-24)
758
+ ├── Enable continuous learning
759
+ ├── Monitor ECE weekly
760
+ ├── Detect concept drift (>5% accuracy drop)
761
+ └── Publish findings
762
+ ```
763
+
764
+ ### Sample Size Requirements
765
+
766
+ | Phase | Samples | Purpose |
767
+ |-------|---------|---------|
768
+ | Bootstrap | 50 | Establish baseline weights |
769
+ | Validation | 100 | Compute accuracy with CI |
770
+ | A/B Test | 64 per group | Detect medium effect (d=0.5) |
771
+ | Production | 200+ | Enable full ensemble |
772
+
773
+ ---
774
+
775
+ ## Ground Truth Sources
776
+
777
+ ### Three-Source Triangulation
778
+
779
+ To validate scientifically, we need independent ground truth:
780
+
781
+ #### Source 1: DORA Metrics (Objective)
782
+
783
+ | DORA Metric | Vibe Score Correlation |
784
+ |-------------|------------------------|
785
+ | Deployment Frequency | Higher → Higher Score |
786
+ | Lead Time for Changes | Shorter → Higher Score |
787
+ | Change Failure Rate | Lower → Higher Score |
788
+ | Mean Time to Restore | Shorter → Higher Score |
789
+
790
+ #### Source 2: Developer Self-Report (Subjective)
791
+
792
+ ```typescript
793
+ interface WeeklySurvey {
794
+ // NASA-TLX (validated scale)
795
+ mentalDemand: 1-10;
796
+ frustration: 1-10;
797
+ effort: 1-10;
798
+
799
+ // Flow State
800
+ concentration: 1-5;
801
+ timeAwareness: 1-5;
802
+
803
+ // Custom
804
+ codeStickiness: 1-5;
805
+ }
806
+ ```
807
+
808
+ #### Source 3: Expert Coding (Behavioral)
809
+
810
+ Train 2-3 human coders to rate git history:
811
+ - Spiral severity (0-3)
812
+ - Frustration signals
813
+ - Code quality indicators
814
+
815
+ **Inter-rater reliability target:** Cohen's κ ≥ 0.70
816
+
817
+ ### Triangulation Matrix
818
+
819
+ ```
820
+ DORA | Survey | Expert | Vibe Score
821
+ (obj) | (subj) | (behav) | (computed)
822
+ ─────────────────────────────────────────────────────────
823
+ High Score Fast | Low | Clean | 0.85+
824
+ deploys | stress | history |
825
+ ─────────────────────────────────────────────────────────
826
+ Low Score Slow/ | High | Spiral | <0.50
827
+ failures | stress | detected|
828
+ ```
829
+
830
+ ---
831
+
832
+ ## Implementation Roadmap
833
+
834
+ ### Phase 1: Core Metrics (Week 1-2)
835
+
836
+ ```
837
+ [ ] Implement FileChurnScore
838
+ - Parse git log --name-only --format
839
+ - Track file touch timestamps
840
+ - Detect 3+ touches in 1 hour
841
+
842
+ [ ] Implement TimeSpiralScore
843
+ - Parse commit timestamps
844
+ - Detect <5min clusters
845
+ - Calculate spiral ratio
846
+
847
+ [ ] Implement VelocityAnomalyScore
848
+ - Calculate personal baseline (last 30 days)
849
+ - Z-score transform
850
+ - Sigmoid normalization
851
+
852
+ [ ] Implement composite VibeScore
853
+ - Weighted combination
854
+ - Normalize to 0-1
855
+ - Add to CLI output
856
+ ```
857
+
858
+ ### Phase 2: Level Recommender (Week 3-4)
859
+
860
+ ```
861
+ [ ] Add --recommend flag to CLI
862
+ vibe-check --recommend
863
+
864
+ [ ] Implement question prompts
865
+ - Interactive mode for 5 questions
866
+ - Or accept as flags
867
+
868
+ [ ] Implement simple formula (baseline)
869
+ Level = 3 + Q1 + Q2 + Q3 + Q4 + Q5
870
+
871
+ [ ] Store session data for future training
872
+ .vibe-check/sessions.jsonl
873
+ ```
874
+
875
+ ### Phase 3: Calibration Loop (Week 5-6)
876
+
877
+ ```
878
+ [ ] Implement outcome assessment
879
+ - Compare score vs expected range
880
+ - Classify as correct/too_high/too_low
881
+
882
+ [ ] Implement ECE calculation
883
+ - Group by level
884
+ - Calculate mean score per level
885
+ - Compare to expected centers
886
+
887
+ [ ] Add calibration report to CLI
888
+ vibe-check --calibration-report
889
+ ```
890
+
891
+ ### Phase 4: ML Models (Week 7-8)
892
+
893
+ ```
894
+ [ ] Implement Ordered Logistic Regression
895
+ - Use mord or custom implementation
896
+ - Train on collected sessions
897
+ - Compare to simple formula
898
+
899
+ [ ] Implement online updates
900
+ - partial_fit after each session
901
+ - Bayesian regularization
902
+
903
+ [ ] Implement weight updates for VibeScore
904
+ - ECE optimization
905
+ - Gradient descent on weights
906
+ ```
907
+
908
+ ### Phase 5: Validation (Week 9-12)
909
+
910
+ ```
911
+ [ ] Collect 100+ sessions
912
+ [ ] Run A/B test (formula vs ML)
913
+ [ ] Calculate all validation metrics
914
+ [ ] Generate reliability diagrams
915
+ [ ] Write validation report
916
+ ```
917
+
918
+ ---
919
+
920
+ ## CLI Interface
921
+
922
+ ```bash
923
+ # Current functionality (enhanced)
924
+ vibe-check --since "1 week ago"
925
+ # Output now includes VibeScore (0-1) in addition to existing metrics
926
+
927
+ # New: Get level recommendation before starting work
928
+ vibe-check --recommend
929
+ # Interactive prompts for 5 questions
930
+ # Output: Recommended Level 3 (82% confidence)
931
+
932
+ # New: Record session with recommendation
933
+ vibe-check --start-session --level 3
934
+ # Records session start time and recommended level
935
+
936
+ # New: End session and record outcome
937
+ vibe-check --end-session
938
+ # Computes all metrics, VibeScore, assesses outcome, updates models
939
+
940
+ # New: View calibration status
941
+ vibe-check --calibration
942
+ # Shows ECE, reliability diagram, model health
943
+
944
+ # New: Export training data
945
+ vibe-check --export-sessions > sessions.json
946
+ ```
947
+
948
+ ---
949
+
950
+ ## Academic References
951
+
952
+ ### Algorithm Sources
953
+ - **Ordered Logistic:** Agresti (2010), *Categorical Data Analysis*
954
+ - **Bayesian Methods:** Gelman et al., *Bayesian Data Analysis*
955
+ - **Online Learning:** Shalev-Shwartz, *Online Learning and Online Convex Optimization*
956
+ - **ECE:** Nixon et al. (2019), "Measuring Calibration in Deep Learning", CVPR
957
+
958
+ ### Validation Methodology
959
+ - **DORA Metrics:** Forsgren, Humble, Kim, *Accelerate* (2018)
960
+ - **SPACE Framework:** Forsgren et al. (2021), ACM Queue
961
+ - **Cohen's Kappa:** Cohen (1960), *Educational and Psychological Measurement*
962
+ - **Power Analysis:** Cohen (1988), *Statistical Power Analysis*
963
+
964
+ ### Anthropic Standards
965
+ - [Statistical Approach to Model Evals](https://www.anthropic.com/research/statistical-approach-to-model-evals)
966
+ - [Challenges in Evaluating AI Systems](https://www.anthropic.com/research/evaluating-ai-systems)
967
+
968
+ ---
969
+
970
+ ## Open Questions
971
+
972
+ 1. **Cold start:** How many commits needed for reliable VibeScore?
973
+ - Hypothesis: 10+ commits minimum
974
+ - Need validation
975
+
976
+ 2. **Team vs individual:** Should metrics be per-developer or per-repo?
977
+ - DORA warns against individual metrics
978
+ - Consider: aggregate for team, awareness for individual
979
+
980
+ 3. **Code stability performance:** Git blame is slow for large repos
981
+ - Option A: Make optional
982
+ - Option B: Sample recent commits only
983
+ - Option C: Async background calculation
984
+
985
+ 4. **Semantic vs non-semantic:** What if both are available?
986
+ - Current plan: Use both as features
987
+ - Non-semantic provides fallback when semantic unavailable
988
+
989
+ ---
990
+
991
+ ## Bundle Stats
992
+
993
+ - Combined research tokens: ~90k
994
+ - Unified bundle tokens: ~12k
995
+ - Compression ratio: ~7.5:1
996
+
997
+ ---
998
+
999
+ ## Next Steps
1000
+
1001
+ Ready for `/plan` to create implementation tasks?
1002
+
1003
+ **Recommended starting point:** Phase 1 (Core Metrics) - adds VibeScore to existing tool without breaking changes.