dev-playbooks 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -0,0 +1,394 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: devbooks-convergence-audit
|
|
3
|
+
description: devbooks-convergence-audit: Evaluates DevBooks workflow convergence using "evidence first, distrust declarations" principle, detects "Sisyphus anti-patterns" and "fake completion". Actively verifies rather than trusting document claims. Use when user says "evaluate convergence/check upgrade health/Sisyphus detection/workflow audit" etc.
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Glob
|
|
6
|
+
- Grep
|
|
7
|
+
- Read
|
|
8
|
+
- Bash
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# DevBooks: Convergence Audit
|
|
12
|
+
|
|
13
|
+
## Core Principle: Anti-Deception Design
|
|
14
|
+
|
|
15
|
+
> **Golden Rule**: **Evidence > Declarations**. Never trust any assertions in documents; must confirm through verifiable evidence.
|
|
16
|
+
|
|
17
|
+
### Scenarios Where AI Gets Deceived (Must Prevent)
|
|
18
|
+
|
|
19
|
+
| Deception Scenario | AI Wrong Behavior | Correct Behavior |
|
|
20
|
+
|--------------------|-------------------|------------------|
|
|
21
|
+
| Document says `Status: Done` | Believes it's complete | Verify: Are tests actually all green? Does evidence exist? |
|
|
22
|
+
| AC matrix all `[x]` | Believes full coverage | Verify: Does test file for each AC exist and pass? |
|
|
23
|
+
| Document says "tests passed" | Believes passed | Verify: Actually run tests or check CI log timestamps |
|
|
24
|
+
| `evidence/` directory exists | Believes evidence exists | Verify: Is directory non-empty? Is content valid test logs? |
|
|
25
|
+
| tasks.md all `[x]` | Believes implemented | Verify: Do corresponding code files exist with substance? |
|
|
26
|
+
| Commit message says "fixed" | Believes fixed | Verify: Did related tests change from red to green? |
|
|
27
|
+
|
|
28
|
+
### Three Anti-Deception Principles
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
1. Distrust Declarations
|
|
32
|
+
- Any "complete/passed/covered" claims in documents are hypotheses to verify
|
|
33
|
+
- Default stance: Claims may be wrong, outdated, or optimistic
|
|
34
|
+
|
|
35
|
+
2. Evidence First
|
|
36
|
+
- Code/test results are the only truth
|
|
37
|
+
- Log timestamps must be later than last code modification
|
|
38
|
+
- Empty directory/file = no evidence
|
|
39
|
+
|
|
40
|
+
3. Cross Validation
|
|
41
|
+
- Declaration vs evidence: Check consistency
|
|
42
|
+
- Code vs tests: Check if they match
|
|
43
|
+
- Multiple documents: Check for contradictions
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Verification Checklist (Execute Each)
|
|
49
|
+
|
|
50
|
+
### Check 1: Status Field Truthfulness Verification
|
|
51
|
+
|
|
52
|
+
**Document Claim**: `verification.md` contains `Status: Done` or `Status: Verified`
|
|
53
|
+
|
|
54
|
+
**Verification Steps**:
|
|
55
|
+
```bash
|
|
56
|
+
# 1. Check if verification.md exists
|
|
57
|
+
[[ -f "verification.md" ]] || echo "❌ verification.md does not exist"
|
|
58
|
+
|
|
59
|
+
# 2. Check if evidence/green-final/ has content
|
|
60
|
+
if [[ -z "$(ls -A evidence/green-final/ 2>/dev/null)" ]]; then
|
|
61
|
+
echo "❌ Status claims complete, but evidence/green-final/ is empty"
|
|
62
|
+
fi
|
|
63
|
+
|
|
64
|
+
# 3. Check if evidence timestamp is later than last code modification
|
|
65
|
+
code_mtime=$(stat -f %m src/ 2>/dev/null || stat -c %Y src/)
|
|
66
|
+
evidence_mtime=$(stat -f %m evidence/green-final/* 2>/dev/null | sort -n | tail -1)
|
|
67
|
+
if [[ $evidence_mtime -lt $code_mtime ]]; then
|
|
68
|
+
echo "❌ Evidence time is earlier than code modification, evidence may be stale"
|
|
69
|
+
fi
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
**Deception Detection**:
|
|
73
|
+
- ⚠️ Status=Done but evidence/ empty → **Fake Completion**
|
|
74
|
+
- ⚠️ Status=Done but evidence timestamp too old → **Stale Evidence**
|
|
75
|
+
- ⚠️ Status=Done but tests actually fail → **False Status**
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
### Check 2: AC Coverage Matrix Truthfulness Verification
|
|
80
|
+
|
|
81
|
+
**Document Claim**: `[x]` in AC matrix means covered
|
|
82
|
+
|
|
83
|
+
**Verification Steps**:
|
|
84
|
+
```bash
|
|
85
|
+
# 1. Extract all ACs claimed as covered
|
|
86
|
+
grep -E '^\| AC-[0-9]+.*\[x\]' verification.md | while read line; do
|
|
87
|
+
ac_id=$(echo "$line" | grep -oE 'AC-[0-9]+')
|
|
88
|
+
test_id=$(echo "$line" | grep -oE 'T-[0-9]+')
|
|
89
|
+
|
|
90
|
+
# 2. Verify corresponding test exists
|
|
91
|
+
if ! grep -rq "$test_id\|$ac_id" tests/; then
|
|
92
|
+
echo "❌ $ac_id claims covered, but no corresponding test found"
|
|
93
|
+
fi
|
|
94
|
+
done
|
|
95
|
+
|
|
96
|
+
# 3. Actually run tests to verify (most reliable)
|
|
97
|
+
npm test 2>&1 | tee /tmp/test-output.log
|
|
98
|
+
if grep -q "FAIL\|Error\|failed" /tmp/test-output.log; then
|
|
99
|
+
echo "❌ ACs claim full coverage, but tests actually have failures"
|
|
100
|
+
fi
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
**Deception Detection**:
|
|
104
|
+
- ⚠️ AC checked but corresponding test file doesn't exist → **False Coverage**
|
|
105
|
+
- ⚠️ AC checked but test actually fails → **Fake Green**
|
|
106
|
+
- ⚠️ AC checked but test content is empty/placeholder → **Placeholder Test**
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
### Check 3: tasks.md Completion Truthfulness Verification
|
|
111
|
+
|
|
112
|
+
**Document Claim**: `[x]` in tasks.md means completed
|
|
113
|
+
|
|
114
|
+
**Verification Steps**:
|
|
115
|
+
```bash
|
|
116
|
+
# 1. Extract all tasks claimed as complete
|
|
117
|
+
grep -E '^\- \[x\]' tasks.md | while read line; do
|
|
118
|
+
# 2. Extract keywords from task description (function name/file name/feature)
|
|
119
|
+
keywords=$(echo "$line" | grep -oE '[A-Za-z]+[A-Za-z0-9]*' | head -5)
|
|
120
|
+
|
|
121
|
+
# 3. Verify code has corresponding implementation
|
|
122
|
+
for kw in $keywords; do
|
|
123
|
+
if ! grep -rq "$kw" src/; then
|
|
124
|
+
echo "⚠️ Task claims complete, but keyword not found in code: $kw"
|
|
125
|
+
fi
|
|
126
|
+
done
|
|
127
|
+
done
|
|
128
|
+
|
|
129
|
+
# 4. Check for "skeleton code" (only function signatures without implementation)
|
|
130
|
+
grep -rE 'throw new Error\(.*not implemented|TODO|FIXME|pass$|\.\.\.}' src/ && \
|
|
131
|
+
echo "⚠️ Found unimplemented placeholder code"
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Deception Detection**:
|
|
135
|
+
- ⚠️ Task checked but code doesn't exist → **False Completion**
|
|
136
|
+
- ⚠️ Task checked but code is placeholder → **Skeleton Code**
|
|
137
|
+
- ⚠️ Task checked but feature not callable → **Dead Code**
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
### Check 4: Evidence Validity Verification
|
|
142
|
+
|
|
143
|
+
**Document Claim**: `evidence/` directory contains test evidence
|
|
144
|
+
|
|
145
|
+
**Verification Steps**:
|
|
146
|
+
```bash
|
|
147
|
+
# 1. Check if directory exists and is non-empty
|
|
148
|
+
if [[ ! -d "evidence" ]] || [[ -z "$(ls -A evidence/)" ]]; then
|
|
149
|
+
echo "❌ evidence/ does not exist or is empty"
|
|
150
|
+
exit 1
|
|
151
|
+
fi
|
|
152
|
+
|
|
153
|
+
# 2. Check if evidence files have substantial content
|
|
154
|
+
for f in evidence/**/*; do
|
|
155
|
+
if [[ -f "$f" ]]; then
|
|
156
|
+
lines=$(wc -l < "$f")
|
|
157
|
+
if [[ $lines -lt 5 ]]; then
|
|
158
|
+
echo "⚠️ Evidence file has too little content: $f ($lines lines)"
|
|
159
|
+
fi
|
|
160
|
+
|
|
161
|
+
# 3. Check if it's a valid test log (contains test framework output characteristics)
|
|
162
|
+
if ! grep -qE 'PASS|FAIL|✓|✗|passed|failed|test|spec' "$f"; then
|
|
163
|
+
echo "⚠️ Evidence file doesn't look like test log: $f"
|
|
164
|
+
fi
|
|
165
|
+
fi
|
|
166
|
+
done
|
|
167
|
+
|
|
168
|
+
# 4. Check if red-baseline evidence really is red (has failures)
|
|
169
|
+
if [[ -d "evidence/red-baseline" ]]; then
|
|
170
|
+
if ! grep -rqE 'FAIL|Error|✗|failed' evidence/red-baseline/; then
|
|
171
|
+
echo "❌ red-baseline claims to be red, but no failure records"
|
|
172
|
+
fi
|
|
173
|
+
fi
|
|
174
|
+
|
|
175
|
+
# 5. Check if green-final evidence really is green (all pass)
|
|
176
|
+
if [[ -d "evidence/green-final" ]]; then
|
|
177
|
+
if grep -rqE 'FAIL|Error|✗|failed' evidence/green-final/; then
|
|
178
|
+
echo "❌ green-final claims to be green, but contains failure records"
|
|
179
|
+
fi
|
|
180
|
+
fi
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Deception Detection**:
|
|
184
|
+
- ⚠️ evidence/ exists but content is empty → **Empty Evidence**
|
|
185
|
+
- ⚠️ Evidence file too small (< 5 lines) → **Placeholder Evidence**
|
|
186
|
+
- ⚠️ red-baseline has no failure records → **Fake Red**
|
|
187
|
+
- ⚠️ green-final contains failure records → **Fake Green**
|
|
188
|
+
|
|
189
|
+
---
|
|
190
|
+
|
|
191
|
+
### Check 5: Git History Cross Validation
|
|
192
|
+
|
|
193
|
+
**Principle**: Git history doesn't lie; use it to verify document claims
|
|
194
|
+
|
|
195
|
+
**Verification Steps**:
|
|
196
|
+
```bash
|
|
197
|
+
# 1. Check if claimed-complete change has corresponding code commits
|
|
198
|
+
change_id="xxx"
|
|
199
|
+
commits=$(git log --oneline --all --grep="$change_id" | wc -l)
|
|
200
|
+
if [[ $commits -eq 0 ]]; then
|
|
201
|
+
echo "❌ Change $change_id claims complete, but no related commits in git history"
|
|
202
|
+
fi
|
|
203
|
+
|
|
204
|
+
# 2. Check if test files were added after code (TDD violation detection)
|
|
205
|
+
for test_file in tests/**/*.test.*; do
|
|
206
|
+
test_added=$(git log --format=%at --follow -- "$test_file" | tail -1)
|
|
207
|
+
# Find corresponding source file
|
|
208
|
+
src_file=$(echo "$test_file" | sed 's/tests/src/' | sed 's/.test//')
|
|
209
|
+
if [[ -f "$src_file" ]]; then
|
|
210
|
+
src_added=$(git log --format=%at --follow -- "$src_file" | tail -1)
|
|
211
|
+
if [[ $test_added -gt $src_added ]]; then
|
|
212
|
+
echo "⚠️ Test added after code (non-TDD): $test_file"
|
|
213
|
+
fi
|
|
214
|
+
fi
|
|
215
|
+
done
|
|
216
|
+
|
|
217
|
+
# 3. Check for "one-time big commits" (may be bypassing process)
|
|
218
|
+
git log --oneline -20 | while read line; do
|
|
219
|
+
commit=$(echo "$line" | cut -d' ' -f1)
|
|
220
|
+
files_changed=$(git show --stat "$commit" | grep -E '[0-9]+ file' | grep -oE '[0-9]+' | head -1)
|
|
221
|
+
if [[ $files_changed -gt 20 ]]; then
|
|
222
|
+
echo "⚠️ Big commit detected: $commit modified $files_changed files, may bypass incremental verification"
|
|
223
|
+
fi
|
|
224
|
+
done
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
**Deception Detection**:
|
|
228
|
+
- ⚠️ Claims complete but no git commits → **Fake Change**
|
|
229
|
+
- ⚠️ Tests added after code → **Retroactive Testing**
|
|
230
|
+
- ⚠️ Many files in one commit → **Bypassing Incremental Verification**
|
|
231
|
+
|
|
232
|
+
---
|
|
233
|
+
|
|
234
|
+
### Check 6: Live Test Run Verification (Most Reliable)
|
|
235
|
+
|
|
236
|
+
**Principle**: Don't trust any logs; actually run tests
|
|
237
|
+
|
|
238
|
+
**Verification Steps**:
|
|
239
|
+
```bash
|
|
240
|
+
# 1. Run full tests
|
|
241
|
+
echo "=== Live Test Verification ==="
|
|
242
|
+
npm test 2>&1 | tee /tmp/live-test.log
|
|
243
|
+
|
|
244
|
+
# 2. Check results
|
|
245
|
+
if grep -qE 'FAIL|Error|failed' /tmp/live-test.log; then
|
|
246
|
+
echo "❌ Live tests failed, document claims not trustworthy"
|
|
247
|
+
grep -E 'FAIL|Error|failed' /tmp/live-test.log
|
|
248
|
+
else
|
|
249
|
+
echo "✅ Live tests passed"
|
|
250
|
+
fi
|
|
251
|
+
|
|
252
|
+
# 3. Compare live results with evidence files
|
|
253
|
+
if [[ -f "evidence/green-final/latest.log" ]]; then
|
|
254
|
+
live_pass=$(grep -c 'PASS\|✓\|passed' /tmp/live-test.log)
|
|
255
|
+
evidence_pass=$(grep -c 'PASS\|✓\|passed' evidence/green-final/latest.log)
|
|
256
|
+
if [[ $live_pass -ne $evidence_pass ]]; then
|
|
257
|
+
echo "⚠️ Live pass count ($live_pass) ≠ evidence pass count ($evidence_pass)"
|
|
258
|
+
fi
|
|
259
|
+
fi
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
**Deception Detection**:
|
|
263
|
+
- ⚠️ Evidence says green but live run fails → **Stale Evidence/Fake Green**
|
|
264
|
+
- ⚠️ Live pass count differs from evidence → **Evidence Fabrication/Environment Difference**
|
|
265
|
+
|
|
266
|
+
---
|
|
267
|
+
|
|
268
|
+
## Composite Scoring Algorithm
|
|
269
|
+
|
|
270
|
+
### Trustworthiness Score (0-100)
|
|
271
|
+
|
|
272
|
+
```python
|
|
273
|
+
def calculate_trustworthiness(checks):
|
|
274
|
+
score = 100
|
|
275
|
+
|
|
276
|
+
# Critical issues (each -20 points)
|
|
277
|
+
critical = [
|
|
278
|
+
"Evidence empty",
|
|
279
|
+
"Live tests failed",
|
|
280
|
+
"Status claims complete but tests fail",
|
|
281
|
+
"green-final contains failure records"
|
|
282
|
+
]
|
|
283
|
+
|
|
284
|
+
# Warning issues (each -10 points)
|
|
285
|
+
warnings = [
|
|
286
|
+
"Evidence timestamp too old",
|
|
287
|
+
"AC corresponding test doesn't exist",
|
|
288
|
+
"Placeholder code",
|
|
289
|
+
"Big commit detected"
|
|
290
|
+
]
|
|
291
|
+
|
|
292
|
+
# Minor issues (each -5 points)
|
|
293
|
+
minor = [
|
|
294
|
+
"Tests added after code",
|
|
295
|
+
"Evidence file too small"
|
|
296
|
+
]
|
|
297
|
+
|
|
298
|
+
for issue in checks.critical_issues:
|
|
299
|
+
score -= 20
|
|
300
|
+
for issue in checks.warnings:
|
|
301
|
+
score -= 10
|
|
302
|
+
for issue in checks.minor_issues:
|
|
303
|
+
score -= 5
|
|
304
|
+
|
|
305
|
+
return max(0, score)
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
### Convergence Determination
|
|
309
|
+
|
|
310
|
+
| Trustworthiness | Determination | Recommendation |
|
|
311
|
+
|-----------------|---------------|----------------|
|
|
312
|
+
| 90-100 | ✅ Trustworthy Convergence | Continue current process |
|
|
313
|
+
| 70-89 | ⚠️ Partially Trustworthy | Need supplementary verification |
|
|
314
|
+
| 50-69 | 🟠 Questionable | Need to rework some steps |
|
|
315
|
+
| < 50 | 🔴 Untrustworthy | Sisyphus trap, need comprehensive review |
|
|
316
|
+
|
|
317
|
+
---
|
|
318
|
+
|
|
319
|
+
## Output Format
|
|
320
|
+
|
|
321
|
+
```markdown
|
|
322
|
+
# DevBooks Convergence Audit Report (Anti-Deception Edition)
|
|
323
|
+
|
|
324
|
+
## Audit Principle
|
|
325
|
+
This report uses "evidence first, distrust declarations" principle. All conclusions are based on verifiable evidence, not document claims.
|
|
326
|
+
|
|
327
|
+
## Declaration vs Evidence Comparison
|
|
328
|
+
|
|
329
|
+
| Check Item | Document Claim | Actual Verification | Conclusion |
|
|
330
|
+
|------------|----------------|---------------------|------------|
|
|
331
|
+
| Status | Done | Tests actually fail | ❌ Fake Completion |
|
|
332
|
+
| AC Coverage | 5/5 checked | 2 ACs have no corresponding tests | ❌ False Coverage |
|
|
333
|
+
| Test Status | All green | Live run 3 failures | ❌ Stale Evidence |
|
|
334
|
+
| tasks.md | 10/10 complete | 3 tasks have no code | ❌ False Completion |
|
|
335
|
+
| evidence/ | Exists | Non-empty, content valid | ✅ Valid |
|
|
336
|
+
|
|
337
|
+
## Trustworthiness Score
|
|
338
|
+
|
|
339
|
+
**Total Score**: 45/100 🔴 Untrustworthy
|
|
340
|
+
|
|
341
|
+
**Deduction Details**:
|
|
342
|
+
- -20: Status=Done but live tests fail
|
|
343
|
+
- -20: ACs claim full coverage but 2 have no tests
|
|
344
|
+
- -10: tasks.md 3 tasks have no code
|
|
345
|
+
- -5: Evidence timestamp earlier than code modification
|
|
346
|
+
|
|
347
|
+
## Deception Detection Results
|
|
348
|
+
|
|
349
|
+
### 🔴 Detected Fake Completions
|
|
350
|
+
1. `change-auth`: Status=Done, but `npm test` fails 3
|
|
351
|
+
2. `fix-cache`: AC-003 checked, but `tests/cache.test.ts` doesn't exist
|
|
352
|
+
|
|
353
|
+
### 🟡 Suspicious Items
|
|
354
|
+
1. `refactor-api`: evidence/green-final/ timestamp 2 days earlier than last code commit
|
|
355
|
+
2. `feature-login`: tasks.md all checked, but `src/login.ts` contains TODO
|
|
356
|
+
|
|
357
|
+
## True Status Determination
|
|
358
|
+
|
|
359
|
+
| Change Package | Claimed Status | True Status | Gap |
|
|
360
|
+
|----------------|----------------|-------------|-----|
|
|
361
|
+
| change-auth | Done | Tests failing | 🔴 Severe |
|
|
362
|
+
| fix-cache | Verified | Coverage incomplete | 🟠 Medium |
|
|
363
|
+
| refactor-api | Ready | Evidence stale | 🟡 Minor |
|
|
364
|
+
|
|
365
|
+
## Recommended Actions
|
|
366
|
+
|
|
367
|
+
### Immediate Action
|
|
368
|
+
1. Revert `change-auth` status to `In Progress`
|
|
369
|
+
2. Add tests for `fix-cache` AC-003
|
|
370
|
+
|
|
371
|
+
### Short-term Improvement
|
|
372
|
+
1. Establish evidence timeliness check (evidence must be later than code)
|
|
373
|
+
2. Force run corresponding tests before AC check-off
|
|
374
|
+
|
|
375
|
+
### Process Improvement
|
|
376
|
+
1. Prohibit manual Status modification; only allow auto-update after script verification
|
|
377
|
+
2. Integrate convergence check in CI, block fake completion from merging
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
---
|
|
381
|
+
|
|
382
|
+
## Completion Status
|
|
383
|
+
|
|
384
|
+
**Status**: ✅ AUDIT_COMPLETED
|
|
385
|
+
|
|
386
|
+
**Core Findings**:
|
|
387
|
+
- Document claim trustworthiness: X%
|
|
388
|
+
- Detected fake completions: N
|
|
389
|
+
- Changes needing rework: M
|
|
390
|
+
|
|
391
|
+
**Next Step**:
|
|
392
|
+
- Fake completion → Immediately revert status, re-verify
|
|
393
|
+
- Suspicious items → Supplement evidence or re-run tests
|
|
394
|
+
- Trustworthy items → Continue current process
|