universal-dev-standards 5.4.0 → 5.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled/ai/standards/adversarial-test.ai.yaml +277 -0
- package/bundled/ai/standards/audit-trail.ai.yaml +113 -0
- package/bundled/ai/standards/chaos-injection-tests.ai.yaml +91 -0
- package/bundled/ai/standards/container-image-standards.ai.yaml +88 -0
- package/bundled/ai/standards/container-security.ai.yaml +331 -0
- package/bundled/ai/standards/cost-budget-test.ai.yaml +96 -0
- package/bundled/ai/standards/data-contract.ai.yaml +110 -0
- package/bundled/ai/standards/data-migration-testing.ai.yaml +96 -0
- package/bundled/ai/standards/data-pipeline.ai.yaml +113 -0
- package/bundled/ai/standards/disaster-recovery-drill.ai.yaml +89 -0
- package/bundled/ai/standards/flaky-test-management.ai.yaml +89 -0
- package/bundled/ai/standards/flow-based-testing.ai.yaml +240 -0
- package/bundled/ai/standards/iac-design-principles.ai.yaml +83 -0
- package/bundled/ai/standards/incident-response.ai.yaml +107 -0
- package/bundled/ai/standards/license-compliance.ai.yaml +106 -0
- package/bundled/ai/standards/llm-output-validation.ai.yaml +269 -0
- package/bundled/ai/standards/mock-boundary.ai.yaml +250 -0
- package/bundled/ai/standards/mutation-testing.ai.yaml +192 -0
- package/bundled/ai/standards/pii-classification.ai.yaml +109 -0
- package/bundled/ai/standards/policy-as-code-testing.ai.yaml +227 -0
- package/bundled/ai/standards/prd-standards.ai.yaml +88 -0
- package/bundled/ai/standards/product-metrics-standards.ai.yaml +111 -0
- package/bundled/ai/standards/prompt-regression.ai.yaml +94 -0
- package/bundled/ai/standards/property-based-testing.ai.yaml +105 -0
- package/bundled/ai/standards/release-quality-manifest.ai.yaml +135 -0
- package/bundled/ai/standards/replay-test.ai.yaml +111 -0
- package/bundled/ai/standards/runbook.ai.yaml +104 -0
- package/bundled/ai/standards/sast-advanced.ai.yaml +135 -0
- package/bundled/ai/standards/schema-evolution.ai.yaml +111 -0
- package/bundled/ai/standards/secret-management-standards.ai.yaml +105 -0
- package/bundled/ai/standards/secure-op.ai.yaml +365 -0
- package/bundled/ai/standards/security-testing.ai.yaml +171 -0
- package/bundled/ai/standards/server-ops-security.ai.yaml +274 -0
- package/bundled/ai/standards/slo-sli.ai.yaml +97 -0
- package/bundled/ai/standards/smoke-test.ai.yaml +87 -0
- package/bundled/ai/standards/supply-chain-attestation.ai.yaml +109 -0
- package/bundled/ai/standards/test-completeness-dimensions.ai.yaml +52 -5
- package/bundled/ai/standards/user-story-mapping.ai.yaml +108 -0
- package/bundled/core/adversarial-test.md +212 -0
- package/bundled/core/chaos-injection-tests.md +116 -0
- package/bundled/core/container-security.md +521 -0
- package/bundled/core/cost-budget-test.md +69 -0
- package/bundled/core/data-migration-testing.md +110 -0
- package/bundled/core/disaster-recovery-drill.md +73 -0
- package/bundled/core/flaky-test-management.md +73 -0
- package/bundled/core/flow-based-testing.md +142 -0
- package/bundled/core/llm-output-validation.md +178 -0
- package/bundled/core/mock-boundary.md +100 -0
- package/bundled/core/mutation-testing.md +97 -0
- package/bundled/core/policy-as-code-testing.md +188 -0
- package/bundled/core/prompt-regression.md +72 -0
- package/bundled/core/property-based-testing.md +73 -0
- package/bundled/core/release-quality-manifest.md +147 -0
- package/bundled/core/replay-test.md +86 -0
- package/bundled/core/sast-advanced.md +300 -0
- package/bundled/core/secure-op.md +314 -0
- package/bundled/core/security-testing.md +87 -0
- package/bundled/core/server-ops-security.md +493 -0
- package/bundled/core/smoke-test.md +65 -0
- package/bundled/core/supply-chain-attestation.md +117 -0
- package/bundled/locales/zh-CN/CHANGELOG.md +3 -3
- package/bundled/locales/zh-CN/README.md +1 -1
- package/bundled/locales/zh-CN/skills/ai-instruction-standards/SKILL.md +5 -5
- package/bundled/locales/zh-TW/CHANGELOG.md +3 -3
- package/bundled/locales/zh-TW/README.md +1 -1
- package/bundled/locales/zh-TW/skills/ai-instruction-standards/SKILL.md +183 -79
- package/bundled/skills/README.md +4 -3
- package/bundled/skills/SKILL_NAMING.md +94 -0
- package/bundled/skills/ai-instruction-standards/SKILL.md +181 -88
- package/bundled/skills/atdd-assistant/SKILL.md +8 -0
- package/bundled/skills/bdd-assistant/SKILL.md +7 -0
- package/bundled/skills/checkin-assistant/SKILL.md +8 -0
- package/bundled/skills/code-review-assistant/SKILL.md +7 -0
- package/bundled/skills/journey-test-assistant/SKILL.md +203 -0
- package/bundled/skills/orchestrate/SKILL.md +167 -0
- package/bundled/skills/plan/SKILL.md +234 -0
- package/bundled/skills/pr-automation-assistant/SKILL.md +8 -0
- package/bundled/skills/push/SKILL.md +49 -2
- package/bundled/skills/{process-automation → skill-builder}/SKILL.md +1 -1
- package/bundled/skills/{forward-derivation → spec-derivation}/SKILL.md +1 -1
- package/bundled/skills/spec-driven-dev/SKILL.md +7 -0
- package/bundled/skills/sweep/SKILL.md +145 -0
- package/bundled/skills/tdd-assistant/SKILL.md +7 -0
- package/package.json +1 -1
- package/src/commands/flow.js +8 -0
- package/src/commands/start.js +14 -0
- package/src/commands/sweep.js +8 -0
- package/src/commands/workflow.js +8 -0
- package/standards-registry.json +426 -4
- package/bundled/locales/zh-CN/skills/ac-coverage-assistant/SKILL.md +0 -190
- package/bundled/locales/zh-CN/skills/forward-derivation/SKILL.md +0 -71
- package/bundled/locales/zh-CN/skills/forward-derivation/guide.md +0 -130
- package/bundled/locales/zh-CN/skills/methodology-system/SKILL.md +0 -88
- package/bundled/locales/zh-CN/skills/methodology-system/create-methodology.md +0 -350
- package/bundled/locales/zh-CN/skills/methodology-system/guide.md +0 -131
- package/bundled/locales/zh-CN/skills/methodology-system/runtime.md +0 -279
- package/bundled/locales/zh-CN/skills/process-automation/SKILL.md +0 -143
- package/bundled/locales/zh-TW/skills/ac-coverage-assistant/SKILL.md +0 -195
- package/bundled/locales/zh-TW/skills/deploy-assistant/SKILL.md +0 -178
- package/bundled/locales/zh-TW/skills/forward-derivation/SKILL.md +0 -69
- package/bundled/locales/zh-TW/skills/forward-derivation/guide.md +0 -415
- package/bundled/locales/zh-TW/skills/methodology-system/SKILL.md +0 -86
- package/bundled/locales/zh-TW/skills/methodology-system/create-methodology.md +0 -350
- package/bundled/locales/zh-TW/skills/methodology-system/guide.md +0 -131
- package/bundled/locales/zh-TW/skills/methodology-system/runtime.md +0 -279
- package/bundled/locales/zh-TW/skills/process-automation/SKILL.md +0 -144
- /package/bundled/skills/{ac-coverage-assistant → ac-coverage}/SKILL.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/SKILL.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/create-methodology.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/guide.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/integrated-flow.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/prerequisite-check.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/runtime.md +0 -0
- /package/bundled/skills/{forward-derivation → spec-derivation}/guide.md +0 -0
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# Disaster Recovery Drill Standards
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
An untested DR plan is a false sense of security. Teams that have never executed their recovery runbook under pressure will discover gaps at the worst possible time. DR drills expose these gaps safely.
|
|
6
|
+
|
|
7
|
+
## RTO/RPO Targets
|
|
8
|
+
|
|
9
|
+
Define these before writing the runbook:
|
|
10
|
+
|
|
11
|
+
| Metric | Definition | VibeOps Commercial Target |
|
|
12
|
+
|--------|-----------|--------------------------|
|
|
13
|
+
| RTO (Recovery Time Objective) | Max acceptable downtime | < 1 hour |
|
|
14
|
+
| RPO (Recovery Point Objective) | Max acceptable data loss | < 24 hours (daily backup) |
|
|
15
|
+
|
|
16
|
+
## Backup Restore Script
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
#!/usr/bin/env bash
|
|
20
|
+
# scripts/backup-restore.sh — DR drill backup restore verification
|
|
21
|
+
set -euo pipefail
|
|
22
|
+
|
|
23
|
+
BACKUP_DIR="${BACKUP_DIR:-/var/backups/vibeops}"
|
|
24
|
+
RESTORE_DIR="${RESTORE_DIR:-/tmp/dr-restore}"
|
|
25
|
+
DB_FILE="${DB_FILE:-vibeops.db}"
|
|
26
|
+
|
|
27
|
+
echo "=== DR Drill: Backup Restore Verification ==="
|
|
28
|
+
echo "Source: ${BACKUP_DIR}/${DB_FILE}.backup"
|
|
29
|
+
echo "Target: ${RESTORE_DIR}/${DB_FILE}"
|
|
30
|
+
|
|
31
|
+
mkdir -p "$RESTORE_DIR"
|
|
32
|
+
|
|
33
|
+
# Find latest backup
|
|
34
|
+
LATEST=$(ls -t "${BACKUP_DIR}"/*.backup 2>/dev/null | head -1)
|
|
35
|
+
if [[ -z "$LATEST" ]]; then
|
|
36
|
+
echo "FAIL: No backup found in ${BACKUP_DIR}"
|
|
37
|
+
exit 1
|
|
38
|
+
fi
|
|
39
|
+
|
|
40
|
+
# Restore
|
|
41
|
+
cp "$LATEST" "${RESTORE_DIR}/${DB_FILE}"
|
|
42
|
+
|
|
43
|
+
# Verify integrity (SQLite)
|
|
44
|
+
if command -v sqlite3 >/dev/null 2>&1; then
|
|
45
|
+
sqlite3 "${RESTORE_DIR}/${DB_FILE}" "PRAGMA integrity_check;" | grep -q "ok" && \
|
|
46
|
+
echo "OK: Database integrity check passed" || \
|
|
47
|
+
{ echo "FAIL: Integrity check failed"; exit 1; }
|
|
48
|
+
fi
|
|
49
|
+
|
|
50
|
+
BACKUP_AGE=$(( ($(date +%s) - $(stat -c %Y "$LATEST" 2>/dev/null || stat -f %m "$LATEST")) / 3600 ))
|
|
51
|
+
echo "OK: Backup age: ${BACKUP_AGE} hours (RPO target: 24h)"
|
|
52
|
+
|
|
53
|
+
echo "=== PASS: Restore complete ==="
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Game Day Protocol
|
|
57
|
+
|
|
58
|
+
1. **Announce**: Notify team 1 week in advance, define scope
|
|
59
|
+
2. **Baseline**: Document current system state
|
|
60
|
+
3. **Inject**: Simulate failure (rename/delete DB, kill process, etc.)
|
|
61
|
+
4. **Execute**: Team follows runbook from scratch — no shortcuts
|
|
62
|
+
5. **Measure**: Record RTO, RPO, issues encountered
|
|
63
|
+
6. **Retrospective**: What was unclear? What was missing?
|
|
64
|
+
|
|
65
|
+
## Runbook Template
|
|
66
|
+
|
|
67
|
+
See `docs/DR-RUNBOOK.md` for the full runbook template.
|
|
68
|
+
|
|
69
|
+
## Related Standards
|
|
70
|
+
|
|
71
|
+
- [Deployment Standards](deployment-standards.md) — deployment pipeline
|
|
72
|
+
- [Chaos Engineering Standards](chaos-engineering-standards.md) — failure injection
|
|
73
|
+
- [Verification Evidence Standards](verification-evidence.md) — drill records
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# Flaky Test Management Standards
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
A single flaky test in a 3000-test suite can erode CI confidence enough that developers start ignoring failures. Once developers learn to "just re-run CI", real bugs slip through. The cost of eliminating flaky tests is always lower than the cost of the false sense of security they create.
|
|
6
|
+
|
|
7
|
+
## Definition
|
|
8
|
+
|
|
9
|
+
A test is **flaky** if it produces different results (pass/fail) on consecutive runs with the same code. The 2% threshold: if a test fails ≥ 2% of runs on `main` without code changes, it is flaky.
|
|
10
|
+
|
|
11
|
+
## Detection
|
|
12
|
+
|
|
13
|
+
Most CI systems can detect flakiness automatically:
|
|
14
|
+
|
|
15
|
+
- **GitHub Actions**: Look for `Flaky tests detected` annotations
|
|
16
|
+
- **Manual**: Run `npx vitest run --reporter=verbose` 5 times, look for non-deterministic results
|
|
17
|
+
- **Vitest**: `vitest run --repeat=5` (runs each test 5 times)
|
|
18
|
+
|
|
19
|
+
## Quarantine Workflow
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
Detected → Quarantine (< 48h) → Track → Fix or Delete (< 30 days)
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
### Quarantine Annotation
|
|
26
|
+
|
|
27
|
+
```typescript
|
|
28
|
+
// TODO: quarantined 2026-05-05 — flaky race condition, see issue #42
|
|
29
|
+
it.skip("reconnects after WebSocket disconnect", async () => {
|
|
30
|
+
// ... test body preserved for reference
|
|
31
|
+
})
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
### Tracking Issue Template
|
|
35
|
+
|
|
36
|
+
```markdown
|
|
37
|
+
**Flaky Test**: `describe > test name`
|
|
38
|
+
**File**: `src/path/to/test.ts`
|
|
39
|
+
**Quarantined**: 2026-05-05
|
|
40
|
+
**Failure rate**: ~5% on main
|
|
41
|
+
**Known failure mode**: `Cannot read property 'socket' of undefined`
|
|
42
|
+
**Root cause hypothesis**: Race condition in WebSocket teardown
|
|
43
|
+
**Deadline**: 2026-06-05
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Common Root Causes
|
|
47
|
+
|
|
48
|
+
| Root Cause | Fix |
|
|
49
|
+
|-----------|-----|
|
|
50
|
+
| Race condition | Use `waitFor()`, `vi.waitFor()`, proper async coordination |
|
|
51
|
+
| Shared state | Reset state in `beforeEach`/`afterEach` |
|
|
52
|
+
| External service | Mock the dependency |
|
|
53
|
+
| File system ordering | Use deterministic sort |
|
|
54
|
+
| Random without seed | Set fixed seed in test |
|
|
55
|
+
| Timing-dependent | Fake timers (`vi.useFakeTimers()`) |
|
|
56
|
+
|
|
57
|
+
## Vitest Configuration
|
|
58
|
+
|
|
59
|
+
```typescript
|
|
60
|
+
// vitest.config.ts
|
|
61
|
+
export default defineConfig({
|
|
62
|
+
test: {
|
|
63
|
+
retry: 2, // retry failed tests up to 2 times
|
|
64
|
+
testTimeout: 10000, // 10s timeout prevents infinite hangs
|
|
65
|
+
hookTimeout: 5000, // 5s hook timeout
|
|
66
|
+
}
|
|
67
|
+
})
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Related Standards
|
|
71
|
+
|
|
72
|
+
- [Testing Standards](testing.md) — overall test pyramid
|
|
73
|
+
- [Test Governance Standards](test-governance.md) — CI policies
|
|
@@ -0,0 +1,142 @@
|
|
|
1
|
+
# Flow-Based Testing
|
|
2
|
+
|
|
3
|
+
**Version**: 1.0.0
|
|
4
|
+
**Last Updated**: 2026-05-04
|
|
5
|
+
**Applicability**: All software projects with multi-step workflows
|
|
6
|
+
**Scope**: universal
|
|
7
|
+
**Industry Standards**: ISO/IEC/IEEE 29119-4 (Test Techniques), ISTQB Foundation Syllabus
|
|
8
|
+
**References**: Decision Table Testing (ISTQB), Pairwise Testing, State Transition Testing
|
|
9
|
+
|
|
10
|
+
[English](.) | [繁體中文](../locales/zh-TW/core/flow-based-testing.md)
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Purpose
|
|
15
|
+
|
|
16
|
+
This document defines a systematic methodology for testing multi-step processes. It addresses the gap between AC-centric tests (which verify individual behaviors in isolation) and flow-level tests (which verify sequential behavior with accumulated state and branch coverage).
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## The Core Problem: AC-Centric vs. Flow-Centric Testing
|
|
21
|
+
|
|
22
|
+
AC-centric tests verify that each acceptance criterion works in isolation. However, they miss two critical categories of bugs:
|
|
23
|
+
|
|
24
|
+
1. **Step interaction bugs**: A bug that only manifests when Step 1's output becomes Step 2's input
|
|
25
|
+
2. **Branch coverage gaps**: Decision points that are never exercised with all possible values
|
|
26
|
+
|
|
27
|
+
**Example**: A pipeline has 8 steps. Each AC passes independently. But when the quota check in Step 3 depends on state accumulated in Steps 1 and 2, the interaction is never tested.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## Three-Step Flow Decomposition
|
|
32
|
+
|
|
33
|
+
### Step 1: Flow Identification
|
|
34
|
+
|
|
35
|
+
Before writing any test code, document:
|
|
36
|
+
|
|
37
|
+
- **Preconditions**: The system's initial state
|
|
38
|
+
- **Step sequence**: The ordered list of actions (Step 1 → Step N)
|
|
39
|
+
- **Decision points**: Every if/else/condition in the flow
|
|
40
|
+
- **Terminal states**: All possible end states (success + each distinct failure)
|
|
41
|
+
|
|
42
|
+
### Step 2: Decision Table Expansion
|
|
43
|
+
|
|
44
|
+
For each decision point, list all possible values. Then apply a coverage strategy:
|
|
45
|
+
|
|
46
|
+
| Strategy | When to Use | Scenario Count |
|
|
47
|
+
|----------|-------------|---------------|
|
|
48
|
+
| **Each-Choice** (minimum) | Low-risk flows, fast feedback | Sum of unique values |
|
|
49
|
+
| **Pairwise** | Medium-risk flows | ~N × max_values |
|
|
50
|
+
| **All-Combinations** | Auth, payment, security | Product of value counts |
|
|
51
|
+
|
|
52
|
+
**Decision Table Example**:
|
|
53
|
+
|
|
54
|
+
| Decision Point | Values |
|
|
55
|
+
|----------------|--------|
|
|
56
|
+
| Authorization | valid / expired / missing |
|
|
57
|
+
| Quota | sufficient / exceeded |
|
|
58
|
+
| External Service | available / timeout / error |
|
|
59
|
+
|
|
60
|
+
Each-Choice minimum: 3 + 2 + 3 = 8 scenarios (vs. the typical 1-2 that teams actually write).
|
|
61
|
+
|
|
62
|
+
### Step 3: Journey Test Structure
|
|
63
|
+
|
|
64
|
+
Write tests with shared state threading — a `ctx` object accumulates state across steps:
|
|
65
|
+
|
|
66
|
+
```typescript
|
|
67
|
+
describe("Flow: Create Order", () => {
|
|
68
|
+
const ctx: { token?: string; orderId?: string } = {}
|
|
69
|
+
|
|
70
|
+
it("Step 1: Login", async () => {
|
|
71
|
+
ctx.token = await login(credentials)
|
|
72
|
+
expect(ctx.token).toBeTruthy()
|
|
73
|
+
})
|
|
74
|
+
|
|
75
|
+
it("Step 2: Create order (uses Step 1 token)", async () => {
|
|
76
|
+
ctx.orderId = await createOrder(ctx.token!, orderData)
|
|
77
|
+
expect(ctx.orderId).toMatch(/^ord-/)
|
|
78
|
+
})
|
|
79
|
+
|
|
80
|
+
it("Step 3: Verify order state (uses Step 2 orderId)", async () => {
|
|
81
|
+
const order = await getOrder(ctx.token!, ctx.orderId!)
|
|
82
|
+
expect(order.status).toBe("pending")
|
|
83
|
+
})
|
|
84
|
+
})
|
|
85
|
+
|
|
86
|
+
describe("Flow Branch: Quota exceeded path", () => {
|
|
87
|
+
it("should return 429 and NOT create order when quota is exhausted", async () => {
|
|
88
|
+
await exhaustQuota(testUser)
|
|
89
|
+
const response = await attemptCreateOrder(testToken, orderData)
|
|
90
|
+
expect(response.status).toBe(429)
|
|
91
|
+
expect(response.body.code).toBe("QUOTA_EXCEEDED")
|
|
92
|
+
// Verify side effects: no order was created
|
|
93
|
+
const orders = await getOrders(testUser)
|
|
94
|
+
expect(orders.length).toBe(0)
|
|
95
|
+
})
|
|
96
|
+
})
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## Anti-Patterns
|
|
102
|
+
|
|
103
|
+
- Testing only the happy path flow (missing failure terminal states)
|
|
104
|
+
- Resetting shared state between steps (breaks state threading)
|
|
105
|
+
- Testing each step in isolation without verifying accumulated state
|
|
106
|
+
- Using a single test for a flow with multiple decision points
|
|
107
|
+
- Applying All-Combinations to every flow (reserve for critical paths only)
|
|
108
|
+
- Not verifying side effects (or absence thereof) in branch tests
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Relationship to Other Standards
|
|
113
|
+
|
|
114
|
+
- **test-completeness-dimensions**: Dimensions 9 (Flow Completeness) and 10 (Branch Coverage) are defined here
|
|
115
|
+
- **behavior-driven-development**: BDD Scenario Outline tables map to decision table expansion
|
|
116
|
+
- **mock-boundary**: Flow tests must respect mock boundary rules (no mocking own module logic)
|
|
117
|
+
- **e2e-testing**: Journey tests run at ST or E2E level; flow tests can run at IT level with real DB
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
## Quick Reference Checklist
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
Flow: ___________________
|
|
125
|
+
|
|
126
|
+
□ Step 1 — Flow Identification
|
|
127
|
+
□ Preconditions documented
|
|
128
|
+
□ Ordered step sequence listed
|
|
129
|
+
□ All decision points extracted
|
|
130
|
+
□ All terminal states defined
|
|
131
|
+
|
|
132
|
+
□ Step 2 — Decision Table
|
|
133
|
+
□ Decision table created
|
|
134
|
+
□ Coverage strategy chosen (Each-Choice / Pairwise / All-Combinations)
|
|
135
|
+
□ Critical flows (auth/payment/security) → All-Combinations
|
|
136
|
+
|
|
137
|
+
□ Step 3 — Journey Test Structure
|
|
138
|
+
□ Happy path journey test (shared ctx, sequential steps)
|
|
139
|
+
□ Each branch outcome has its own describe block
|
|
140
|
+
□ Branch tests verify both response AND absence of side effects
|
|
141
|
+
□ No beforeEach resetting ctx between steps
|
|
142
|
+
```
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
# LLM 輸出驗證標準
|
|
2
|
+
|
|
3
|
+
> 標準 ID:`llm-output-validation`
|
|
4
|
+
> 版本:v1.0.0
|
|
5
|
+
> 最後更新:2026-05-05
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 為什麼需要 LLM 輸出驗證?
|
|
10
|
+
|
|
11
|
+
LLM 輸出具有**不確定性**:同一個 prompt 在不同時間、不同模型版本下可能產生格式不一致的輸出。如果不加以驗證,這些輸出可能在下游管線中造成靜默失敗(silent failure)——不是報錯,而是用了一個錯誤的預設值或 `undefined`。
|
|
12
|
+
|
|
13
|
+
LLM 輸出驗證包含三個層次:
|
|
14
|
+
|
|
15
|
+
| 層次 | 問題 | 工具 |
|
|
16
|
+
|------|------|------|
|
|
17
|
+
| 結構驗證 | 輸出格式是否正確? | JSON Schema、Zod、Pydantic |
|
|
18
|
+
| 語意驗證 | 宣稱的事實是否有根據? | NLI probe、Grounding check |
|
|
19
|
+
| 行為驗證 | Agent 是否正確拒絕越界請求? | 紅隊語料庫、拒絕評估 |
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## 一、Schema Contract Test(結構驗證)
|
|
24
|
+
|
|
25
|
+
### 核心概念
|
|
26
|
+
|
|
27
|
+
每個 AI Agent 應宣告一份 `output-schema.json`(JSON Schema 格式),並提供對應的 contract test。
|
|
28
|
+
|
|
29
|
+
**Contract test 的目的**:
|
|
30
|
+
- 確認 schema 本身是合法的 JSON Schema
|
|
31
|
+
- 確認 valid fixtures 通過驗證
|
|
32
|
+
- 確認 invalid fixtures(缺少必填欄位、型別錯誤、enum 違規)被拒絕
|
|
33
|
+
|
|
34
|
+
### 推薦目錄結構
|
|
35
|
+
|
|
36
|
+
```
|
|
37
|
+
agents/<agent-name>/
|
|
38
|
+
output-schema.json ← JSON Schema 定義
|
|
39
|
+
__tests__/
|
|
40
|
+
contract.test.ts ← Contract test suite
|
|
41
|
+
__fixtures__/
|
|
42
|
+
valid.json ← 真實 LLM 輸出 golden fixture
|
|
43
|
+
invalid-missing-id.json ← 缺少必填欄位的 fixture
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### TypeScript 範例(使用 Ajv)
|
|
47
|
+
|
|
48
|
+
```typescript
|
|
49
|
+
import Ajv from "ajv"
|
|
50
|
+
import schema from "../output-schema.json"
|
|
51
|
+
import validFixture from "../__fixtures__/valid.json"
|
|
52
|
+
|
|
53
|
+
const ajv = new Ajv({ strict: false })
|
|
54
|
+
const validate = ajv.compile(schema)
|
|
55
|
+
|
|
56
|
+
// 測試 1:Schema 本身是合法的 JSON Schema
|
|
57
|
+
it("schema is valid JSON Schema", () => {
|
|
58
|
+
expect(ajv.validateSchema(schema)).toBe(true)
|
|
59
|
+
})
|
|
60
|
+
|
|
61
|
+
// 測試 2:Valid fixture 通過驗證
|
|
62
|
+
it("valid fixture passes schema", () => {
|
|
63
|
+
expect(validate(validFixture)).toBe(true)
|
|
64
|
+
})
|
|
65
|
+
|
|
66
|
+
// 測試 3:空 object 被拒絕
|
|
67
|
+
it("empty object is rejected", () => {
|
|
68
|
+
expect(validate({})).toBe(false)
|
|
69
|
+
})
|
|
70
|
+
|
|
71
|
+
// 測試 4:缺少 source_agent 被拒絕
|
|
72
|
+
it("object missing source_agent is rejected", () => {
|
|
73
|
+
const { source_agent, ...without } = validFixture
|
|
74
|
+
expect(validate(without)).toBe(false)
|
|
75
|
+
})
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### Python 範例(使用 Pydantic)
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
from pydantic import ValidationError
|
|
82
|
+
from your_module import AgentOutput
|
|
83
|
+
|
|
84
|
+
# 測試 valid fixture
|
|
85
|
+
valid_data = { "version": "1.0.0", "source_agent": "planner", ... }
|
|
86
|
+
output = AgentOutput(**valid_data) # 不拋出 exception
|
|
87
|
+
|
|
88
|
+
# 測試 invalid fixture
|
|
89
|
+
try:
|
|
90
|
+
AgentOutput(version="bad-format", source_agent="planner")
|
|
91
|
+
assert False, "Should have raised"
|
|
92
|
+
except ValidationError:
|
|
93
|
+
pass # 預期行為
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
## 二、幻覺偵測(Semantic Validation)
|
|
99
|
+
|
|
100
|
+
### 什麼是幻覺?
|
|
101
|
+
|
|
102
|
+
LLM 產生「聽起來正確但實際上沒有根據」的內容。例如:
|
|
103
|
+
- 虛構的 API 文件 URL
|
|
104
|
+
- 不存在的資料庫欄位名稱
|
|
105
|
+
- 未在 context 中出現的 dependency 版本
|
|
106
|
+
|
|
107
|
+
### 偵測策略
|
|
108
|
+
|
|
109
|
+
| 策略 | 適用場景 | 自動化程度 |
|
|
110
|
+
|------|---------|-----------|
|
|
111
|
+
| **Schema 結構化輸出** | Agent 輸出 JSON,enum 限制可能值 | 高(自動) |
|
|
112
|
+
| **Grounding Check** | RAG 系統,回答需引用 context | 中(需 NLI 模型) |
|
|
113
|
+
| **信心度標記** | Agent 在輸出中包含 `confidence` 分數 | 中(需 prompt 設計) |
|
|
114
|
+
| **紅隊語料庫** | 主動測試越界請求的拒絕行為 | 高(自動) |
|
|
115
|
+
|
|
116
|
+
### 幻覺率目標
|
|
117
|
+
|
|
118
|
+
| Agent 類型 | Schema 合規率 | 事實幻覺率 |
|
|
119
|
+
|-----------|-------------|----------|
|
|
120
|
+
| 結構化 JSON Agent | ≥ 99% | ≤ 5% |
|
|
121
|
+
| RAG Agent | ≥ 95% | ≤ 5% |
|
|
122
|
+
| 對話 Agent | ≥ 90% | ≤ 10% |
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## 三、Prompt 回歸測試
|
|
127
|
+
|
|
128
|
+
### 何時需要跑 Prompt 回歸測試?
|
|
129
|
+
|
|
130
|
+
- 修改任何 `agents/*/prompt.md`
|
|
131
|
+
- 模型版本升級(相同 prompt,不同 model)
|
|
132
|
+
- Schema 新增 required field
|
|
133
|
+
|
|
134
|
+
### 回歸測試流程
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
# 1. 修改前:用 temperature=0 記錄 golden output
|
|
138
|
+
vibeops run planner --input fixtures/planner-input.json --temp 0 > golden.json
|
|
139
|
+
|
|
140
|
+
# 2. 修改後:重跑並比對
|
|
141
|
+
vibeops run planner --input fixtures/planner-input.json --temp 0 > after.json
|
|
142
|
+
|
|
143
|
+
# 3. 用 contract test 驗證 after.json 仍符合 schema
|
|
144
|
+
npx vitest run agents/__tests__/contract.test.ts
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## 四、品質閘門(Quality Gates)
|
|
150
|
+
|
|
151
|
+
| 閘門 | 閾值 | 強制程度 |
|
|
152
|
+
|------|------|---------|
|
|
153
|
+
| Schema 合規(CI) | 100% | Block merge |
|
|
154
|
+
| 空 object 拒絕(CI)| 100% | Block merge |
|
|
155
|
+
| Prompt 修改後回歸(CI)| schema 合規維持 | Block merge |
|
|
156
|
+
| 幻覺率(pre-release)| ≤ 5% | Advisory |
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## 五、工具推薦
|
|
161
|
+
|
|
162
|
+
| 工具 | 語言 | 用途 |
|
|
163
|
+
|------|------|------|
|
|
164
|
+
| [Ajv](https://ajv.js.org/) | TypeScript/JS | JSON Schema contract test |
|
|
165
|
+
| [Zod](https://zod.dev/) | TypeScript | Runtime type validation |
|
|
166
|
+
| [Pydantic](https://docs.pydantic.dev/) | Python | Schema + type validation |
|
|
167
|
+
| [DeepEval](https://deepeval.com/) | Python | LLM 幻覺率、faithfulness 評分 |
|
|
168
|
+
| [Ragas](https://docs.ragas.io/) | Python | RAG grounded answer rate |
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## 參考標準
|
|
173
|
+
|
|
174
|
+
- NIST AI RMF (AI 100-1, 2023) — AI 風險管理框架
|
|
175
|
+
- OWASP Top 10 for LLM Applications v1.1 — LLM01: Prompt Injection
|
|
176
|
+
- ISO/IEC 42001:2023 — AI 管理系統
|
|
177
|
+
- [UDS `security-testing.ai.yaml`](./security-testing.md) — SAST + DAST 整合
|
|
178
|
+
- [UDS `adversarial-test.ai.yaml`](./adversarial-test.md) — Prompt injection 紅隊標準
|
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# Mock Boundary Standards
|
|
2
|
+
|
|
3
|
+
**Version**: 1.0.0
|
|
4
|
+
**Last Updated**: 2026-05-04
|
|
5
|
+
**Applicability**: All software projects with unit and integration tests
|
|
6
|
+
**Scope**: universal
|
|
7
|
+
**Industry Standards**: ISTQB Foundation (Test Doubles), xUnit Patterns (Gerard Meszaros)
|
|
8
|
+
**References**: "Working Effectively with Legacy Code" (Feathers), "Growing Object-Oriented Software" (Freeman & Pryce)
|
|
9
|
+
|
|
10
|
+
[English](.) | [繁體中文](../locales/zh-TW/core/mock-boundary.md)
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Purpose
|
|
15
|
+
|
|
16
|
+
This document defines rules for what can and cannot be mocked in tests. Its goal is to prevent **hollow tests** — tests that always pass but fail to detect real bugs because they replace the system's logic with stubs.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## The Hollow Test Problem
|
|
21
|
+
|
|
22
|
+
A hollow test mocks so much of the system that the test becomes a specification of mock wiring rather than system behavior. The classic symptom: you can delete the implementation file and the test still passes.
|
|
23
|
+
|
|
24
|
+
**Real example (VibeOps SPEC-002.test.ts)**:
|
|
25
|
+
|
|
26
|
+
```typescript
|
|
27
|
+
vi.mock('../../src/runner/agent-runner.js') // Core logic replaced
|
|
28
|
+
vi.mock('../../src/runner/guardian-hooks.js') // Core logic replaced
|
|
29
|
+
vi.mock('../../src/runner/prototyper.js') // Core logic replaced
|
|
30
|
+
vi.mock('../../src/runner/iteration-report.js') // Core logic replaced
|
|
31
|
+
vi.mock('../../src/memory/memory-store.js') // Core logic replaced
|
|
32
|
+
vi.mock('node:fs/promises', ...) // I/O replaced
|
|
33
|
+
|
|
34
|
+
// All assertions verify mock call counts — not actual outputs.
|
|
35
|
+
// runPipeline() touches zero real code.
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## What You CAN Mock
|
|
41
|
+
|
|
42
|
+
| Category | Examples | Reason |
|
|
43
|
+
|----------|----------|--------|
|
|
44
|
+
| External HTTP services | LLM APIs, payment gateways, email services | Prevents flaky tests; controls response scenarios |
|
|
45
|
+
| Time functions | `Date.now()`, `new Date()`, `setTimeout` | Makes tests deterministic |
|
|
46
|
+
| Environment variables | `process.env.NODE_ENV`, `process.env.LICENSE_KEY` | Enables config variation |
|
|
47
|
+
| File system (unit tests only) | `fs.readFile`, `fs.writeFile` | Avoids I/O in fast unit tests |
|
|
48
|
+
| Cross-module boundaries (with IT counterpart) | Other modules' public APIs | Isolates unit under test |
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## What You CANNOT Mock
|
|
53
|
+
|
|
54
|
+
| Category | Example Violation | Why Forbidden |
|
|
55
|
+
|----------|-------------------|---------------|
|
|
56
|
+
| Own module's core logic | `vi.mock('./pipeline-runner.js')` in pipeline-runner tests | Makes the test a no-op |
|
|
57
|
+
| Database in IT/flow/E2E tests | `vi.mock('./db/client.js')` in integration tests | Hides query bugs, schema issues |
|
|
58
|
+
| HTTP framework internals | `vi.mock('express')` | Real routing may be broken |
|
|
59
|
+
| Security controls | Always-pass auth middleware stub | Security regressions invisible |
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Hollow Test Detection
|
|
64
|
+
|
|
65
|
+
Before submitting a test file, check:
|
|
66
|
+
|
|
67
|
+
1. **Mock count ≥ import count** → Review: at least one assertion must verify actual output
|
|
68
|
+
2. **All assertions are `.toHaveBeenCalled()` variants** → Add output-value assertions
|
|
69
|
+
3. **Mock path matches test subject directory** → Self-referential mock; remove it
|
|
70
|
+
4. **More mock setup lines than assertion lines** → Likely hollow
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Anti-Patterns
|
|
75
|
+
|
|
76
|
+
- **Total Mock Isolation**: Every import mocked; only mock interactions asserted
|
|
77
|
+
- **Mock the World**: External + internal + DB + FS all mocked in one test
|
|
78
|
+
- **Orphan Mock**: Cross-module mock with no integration test counterpart
|
|
79
|
+
- **Security Bypass Mock**: Auth/permission logic replaced with pass-through stub
|
|
80
|
+
- **Database Mock Cascade**: DB returns hardcoded data, hiding real query errors
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Rules Summary
|
|
85
|
+
|
|
86
|
+
| Rule | Trigger | Action |
|
|
87
|
+
|------|---------|--------|
|
|
88
|
+
| No self-mock | Test file mocks its own module | Remove mock; let real code run |
|
|
89
|
+
| Real DB in IT/flow | Writing IT or flow test | Use in-memory SQLite or test schema |
|
|
90
|
+
| IT counterpart | Mocking cross-module boundary | Ensure corresponding IT exists |
|
|
91
|
+
| No security mock | Test involves auth/permissions | Use real test user + real token |
|
|
92
|
+
| Hollow review | Mock count ≥ import count | Add output-value assertion |
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## Relationship to Other Standards
|
|
97
|
+
|
|
98
|
+
- **testing**: Mock boundary rules apply to all test levels in the testing pyramid
|
|
99
|
+
- **test-completeness-dimensions**: Dimension 8 (AI Test Quality) references these rules
|
|
100
|
+
- **flow-based-testing**: Flow tests must follow mock boundary rules
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
# Mutation Testing Standards
|
|
2
|
+
|
|
3
|
+
**Version**: 1.0.0
|
|
4
|
+
**Last Updated**: 2026-05-04
|
|
5
|
+
**Applicability**: All software projects with unit/integration tests
|
|
6
|
+
**Scope**: universal
|
|
7
|
+
**Industry Standards**: ISTQB Foundation Syllabus (test effectiveness metrics)
|
|
8
|
+
**References**: "Introduction to Software Testing" (Ammann & Offutt), Stryker Mutator docs
|
|
9
|
+
|
|
10
|
+
[English](.) | [繁體中文](../locales/zh-TW/core/mutation-testing.md)
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Purpose
|
|
15
|
+
|
|
16
|
+
Mutation testing evaluates test suite effectiveness by injecting artificial bugs and checking whether tests detect them. It answers the question that line coverage cannot: **"Do my tests actually verify correct behavior?"**
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Key Concept: Mutation Score
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
Mutation Score = Killed Mutants / (Killed + Survived) × 100%
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
- **Killed**: Test suite detected the artificial bug (test failed) ✅
|
|
27
|
+
- **Survived**: Test suite missed the bug (tests still pass) ❌
|
|
28
|
+
|
|
29
|
+
A test with `expect(x).toBeDefined()` can achieve 100% line coverage but survive many mutations (because `x` being `null`, `0`, or `"wrong"` all satisfy `.toBeDefined()`).
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Tools
|
|
34
|
+
|
|
35
|
+
| Language | Tool | Command |
|
|
36
|
+
|----------|------|---------|
|
|
37
|
+
| TypeScript/JS | Stryker Mutator | `npx stryker run` |
|
|
38
|
+
| Python | mutmut | `mutmut run` |
|
|
39
|
+
| Java | PIT (Pitest) | `mvn pitest:mutationCoverage` |
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Thresholds
|
|
44
|
+
|
|
45
|
+
| Module Type | Minimum Score | Enforcement |
|
|
46
|
+
|-------------|--------------|-------------|
|
|
47
|
+
| Auth/License/Payment/Security | 80% | Block release |
|
|
48
|
+
| Standard business logic | 70% | Warning; resolve before next release |
|
|
49
|
+
| AI-generated tests | 50% | Required; reject if below |
|
|
50
|
+
| Overall project | 60% | Track trend; alert on regression |
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## When to Run
|
|
55
|
+
|
|
56
|
+
| Trigger | Command | Enforcement |
|
|
57
|
+
|---------|---------|-------------|
|
|
58
|
+
| Pre-release gate | `npm run test:mutation` | ≥ 60% overall |
|
|
59
|
+
| Critical module change | `npx stryker run --mutate 'src/auth/**'` | ≥ 80% |
|
|
60
|
+
| AI-generated test review | `npx stryker run` | ≥ 50% |
|
|
61
|
+
|
|
62
|
+
**Never** add mutation testing to commit hooks — it's too slow (10-60 minutes).
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Stryker Quick Start (TypeScript + Vitest)
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
npm install --save-dev @stryker-mutator/core @stryker-mutator/vitest-runner
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
```json
|
|
73
|
+
// stryker.config.json
|
|
74
|
+
{
|
|
75
|
+
"testRunner": "vitest",
|
|
76
|
+
"coverageAnalysis": "perTest",
|
|
77
|
+
"mutate": ["src/license/**/*.ts", "!src/**/*.test.ts"],
|
|
78
|
+
"thresholds": { "high": 80, "low": 60, "break": 50 }
|
|
79
|
+
}
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Anti-Patterns
|
|
85
|
+
|
|
86
|
+
- Treating line coverage as a proxy for test effectiveness
|
|
87
|
+
- Adding mutation testing to CI for every PR (too slow)
|
|
88
|
+
- Accepting AI-generated tests without mutation score validation
|
|
89
|
+
- Killing mutations by adding `toBeDefined()` assertions
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Relationship to Other Standards
|
|
94
|
+
|
|
95
|
+
- `test-completeness-dimensions`: Dimension 8 (AI Test Quality) references mutation score
|
|
96
|
+
- `mock-boundary`: Hollow tests survive many mutations; mock boundary rules prevent hollow tests
|
|
97
|
+
- `testing`: Mutation testing is the quality gate on top of the test pyramid
|