universal-dev-standards 5.4.0 → 5.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (114) hide show
  1. package/bundled/ai/standards/adversarial-test.ai.yaml +277 -0
  2. package/bundled/ai/standards/audit-trail.ai.yaml +113 -0
  3. package/bundled/ai/standards/chaos-injection-tests.ai.yaml +91 -0
  4. package/bundled/ai/standards/container-image-standards.ai.yaml +88 -0
  5. package/bundled/ai/standards/container-security.ai.yaml +331 -0
  6. package/bundled/ai/standards/cost-budget-test.ai.yaml +96 -0
  7. package/bundled/ai/standards/data-contract.ai.yaml +110 -0
  8. package/bundled/ai/standards/data-migration-testing.ai.yaml +96 -0
  9. package/bundled/ai/standards/data-pipeline.ai.yaml +113 -0
  10. package/bundled/ai/standards/disaster-recovery-drill.ai.yaml +89 -0
  11. package/bundled/ai/standards/flaky-test-management.ai.yaml +89 -0
  12. package/bundled/ai/standards/flow-based-testing.ai.yaml +240 -0
  13. package/bundled/ai/standards/iac-design-principles.ai.yaml +83 -0
  14. package/bundled/ai/standards/incident-response.ai.yaml +107 -0
  15. package/bundled/ai/standards/license-compliance.ai.yaml +106 -0
  16. package/bundled/ai/standards/llm-output-validation.ai.yaml +269 -0
  17. package/bundled/ai/standards/mock-boundary.ai.yaml +250 -0
  18. package/bundled/ai/standards/mutation-testing.ai.yaml +192 -0
  19. package/bundled/ai/standards/pii-classification.ai.yaml +109 -0
  20. package/bundled/ai/standards/policy-as-code-testing.ai.yaml +227 -0
  21. package/bundled/ai/standards/prd-standards.ai.yaml +88 -0
  22. package/bundled/ai/standards/product-metrics-standards.ai.yaml +111 -0
  23. package/bundled/ai/standards/prompt-regression.ai.yaml +94 -0
  24. package/bundled/ai/standards/property-based-testing.ai.yaml +105 -0
  25. package/bundled/ai/standards/release-quality-manifest.ai.yaml +135 -0
  26. package/bundled/ai/standards/replay-test.ai.yaml +111 -0
  27. package/bundled/ai/standards/runbook.ai.yaml +104 -0
  28. package/bundled/ai/standards/sast-advanced.ai.yaml +135 -0
  29. package/bundled/ai/standards/schema-evolution.ai.yaml +111 -0
  30. package/bundled/ai/standards/secret-management-standards.ai.yaml +105 -0
  31. package/bundled/ai/standards/secure-op.ai.yaml +365 -0
  32. package/bundled/ai/standards/security-testing.ai.yaml +171 -0
  33. package/bundled/ai/standards/server-ops-security.ai.yaml +274 -0
  34. package/bundled/ai/standards/slo-sli.ai.yaml +97 -0
  35. package/bundled/ai/standards/smoke-test.ai.yaml +87 -0
  36. package/bundled/ai/standards/supply-chain-attestation.ai.yaml +109 -0
  37. package/bundled/ai/standards/test-completeness-dimensions.ai.yaml +52 -5
  38. package/bundled/ai/standards/user-story-mapping.ai.yaml +108 -0
  39. package/bundled/core/adversarial-test.md +212 -0
  40. package/bundled/core/chaos-injection-tests.md +116 -0
  41. package/bundled/core/container-security.md +521 -0
  42. package/bundled/core/cost-budget-test.md +69 -0
  43. package/bundled/core/data-migration-testing.md +110 -0
  44. package/bundled/core/disaster-recovery-drill.md +73 -0
  45. package/bundled/core/flaky-test-management.md +73 -0
  46. package/bundled/core/flow-based-testing.md +142 -0
  47. package/bundled/core/llm-output-validation.md +178 -0
  48. package/bundled/core/mock-boundary.md +100 -0
  49. package/bundled/core/mutation-testing.md +97 -0
  50. package/bundled/core/policy-as-code-testing.md +188 -0
  51. package/bundled/core/prompt-regression.md +72 -0
  52. package/bundled/core/property-based-testing.md +73 -0
  53. package/bundled/core/release-quality-manifest.md +147 -0
  54. package/bundled/core/replay-test.md +86 -0
  55. package/bundled/core/sast-advanced.md +300 -0
  56. package/bundled/core/secure-op.md +314 -0
  57. package/bundled/core/security-testing.md +87 -0
  58. package/bundled/core/server-ops-security.md +493 -0
  59. package/bundled/core/smoke-test.md +65 -0
  60. package/bundled/core/supply-chain-attestation.md +117 -0
  61. package/bundled/locales/zh-CN/CHANGELOG.md +3 -3
  62. package/bundled/locales/zh-CN/README.md +1 -1
  63. package/bundled/locales/zh-CN/skills/ai-instruction-standards/SKILL.md +5 -5
  64. package/bundled/locales/zh-TW/CHANGELOG.md +3 -3
  65. package/bundled/locales/zh-TW/README.md +1 -1
  66. package/bundled/locales/zh-TW/skills/ai-instruction-standards/SKILL.md +183 -79
  67. package/bundled/skills/README.md +4 -3
  68. package/bundled/skills/SKILL_NAMING.md +94 -0
  69. package/bundled/skills/ai-instruction-standards/SKILL.md +181 -88
  70. package/bundled/skills/atdd-assistant/SKILL.md +8 -0
  71. package/bundled/skills/bdd-assistant/SKILL.md +7 -0
  72. package/bundled/skills/checkin-assistant/SKILL.md +8 -0
  73. package/bundled/skills/code-review-assistant/SKILL.md +7 -0
  74. package/bundled/skills/journey-test-assistant/SKILL.md +203 -0
  75. package/bundled/skills/orchestrate/SKILL.md +167 -0
  76. package/bundled/skills/plan/SKILL.md +234 -0
  77. package/bundled/skills/pr-automation-assistant/SKILL.md +8 -0
  78. package/bundled/skills/push/SKILL.md +49 -2
  79. package/bundled/skills/{process-automation → skill-builder}/SKILL.md +1 -1
  80. package/bundled/skills/{forward-derivation → spec-derivation}/SKILL.md +1 -1
  81. package/bundled/skills/spec-driven-dev/SKILL.md +7 -0
  82. package/bundled/skills/sweep/SKILL.md +145 -0
  83. package/bundled/skills/tdd-assistant/SKILL.md +7 -0
  84. package/package.json +1 -1
  85. package/src/commands/flow.js +8 -0
  86. package/src/commands/start.js +14 -0
  87. package/src/commands/sweep.js +8 -0
  88. package/src/commands/workflow.js +8 -0
  89. package/standards-registry.json +426 -4
  90. package/bundled/locales/zh-CN/skills/ac-coverage-assistant/SKILL.md +0 -190
  91. package/bundled/locales/zh-CN/skills/forward-derivation/SKILL.md +0 -71
  92. package/bundled/locales/zh-CN/skills/forward-derivation/guide.md +0 -130
  93. package/bundled/locales/zh-CN/skills/methodology-system/SKILL.md +0 -88
  94. package/bundled/locales/zh-CN/skills/methodology-system/create-methodology.md +0 -350
  95. package/bundled/locales/zh-CN/skills/methodology-system/guide.md +0 -131
  96. package/bundled/locales/zh-CN/skills/methodology-system/runtime.md +0 -279
  97. package/bundled/locales/zh-CN/skills/process-automation/SKILL.md +0 -143
  98. package/bundled/locales/zh-TW/skills/ac-coverage-assistant/SKILL.md +0 -195
  99. package/bundled/locales/zh-TW/skills/deploy-assistant/SKILL.md +0 -178
  100. package/bundled/locales/zh-TW/skills/forward-derivation/SKILL.md +0 -69
  101. package/bundled/locales/zh-TW/skills/forward-derivation/guide.md +0 -415
  102. package/bundled/locales/zh-TW/skills/methodology-system/SKILL.md +0 -86
  103. package/bundled/locales/zh-TW/skills/methodology-system/create-methodology.md +0 -350
  104. package/bundled/locales/zh-TW/skills/methodology-system/guide.md +0 -131
  105. package/bundled/locales/zh-TW/skills/methodology-system/runtime.md +0 -279
  106. package/bundled/locales/zh-TW/skills/process-automation/SKILL.md +0 -144
  107. /package/bundled/skills/{ac-coverage-assistant → ac-coverage}/SKILL.md +0 -0
  108. /package/bundled/skills/{methodology-system → dev-methodology}/SKILL.md +0 -0
  109. /package/bundled/skills/{methodology-system → dev-methodology}/create-methodology.md +0 -0
  110. /package/bundled/skills/{methodology-system → dev-methodology}/guide.md +0 -0
  111. /package/bundled/skills/{methodology-system → dev-methodology}/integrated-flow.md +0 -0
  112. /package/bundled/skills/{methodology-system → dev-methodology}/prerequisite-check.md +0 -0
  113. /package/bundled/skills/{methodology-system → dev-methodology}/runtime.md +0 -0
  114. /package/bundled/skills/{forward-derivation → spec-derivation}/guide.md +0 -0
@@ -0,0 +1,73 @@
1
+ # Disaster Recovery Drill Standards
2
+
3
+ ## Overview
4
+
5
+ An untested DR plan is a false sense of security. Teams that have never executed their recovery runbook under pressure will discover gaps at the worst possible time. DR drills expose these gaps safely.
6
+
7
+ ## RTO/RPO Targets
8
+
9
+ Define these before writing the runbook:
10
+
11
+ | Metric | Definition | VibeOps Commercial Target |
12
+ |--------|-----------|--------------------------|
13
+ | RTO (Recovery Time Objective) | Max acceptable downtime | < 1 hour |
14
+ | RPO (Recovery Point Objective) | Max acceptable data loss | < 24 hours (daily backup) |
15
+
16
+ ## Backup Restore Script
17
+
18
+ ```bash
19
+ #!/usr/bin/env bash
20
+ # scripts/backup-restore.sh — DR drill backup restore verification
21
+ set -euo pipefail
22
+
23
+ BACKUP_DIR="${BACKUP_DIR:-/var/backups/vibeops}"
24
+ RESTORE_DIR="${RESTORE_DIR:-/tmp/dr-restore}"
25
+ DB_FILE="${DB_FILE:-vibeops.db}"
26
+
27
+ echo "=== DR Drill: Backup Restore Verification ==="
28
+ echo "Source: ${BACKUP_DIR}/${DB_FILE}.backup"
29
+ echo "Target: ${RESTORE_DIR}/${DB_FILE}"
30
+
31
+ mkdir -p "$RESTORE_DIR"
32
+
33
+ # Find latest backup
34
+ LATEST=$(ls -t "${BACKUP_DIR}"/*.backup 2>/dev/null | head -1)
35
+ if [[ -z "$LATEST" ]]; then
36
+ echo "FAIL: No backup found in ${BACKUP_DIR}"
37
+ exit 1
38
+ fi
39
+
40
+ # Restore
41
+ cp "$LATEST" "${RESTORE_DIR}/${DB_FILE}"
42
+
43
+ # Verify integrity (SQLite)
44
+ if command -v sqlite3 >/dev/null 2>&1; then
45
+ sqlite3 "${RESTORE_DIR}/${DB_FILE}" "PRAGMA integrity_check;" | grep -q "ok" && \
46
+ echo "OK: Database integrity check passed" || \
47
+ { echo "FAIL: Integrity check failed"; exit 1; }
48
+ fi
49
+
50
+ BACKUP_AGE=$(( ($(date +%s) - $(stat -c %Y "$LATEST" 2>/dev/null || stat -f %m "$LATEST")) / 3600 ))
51
+ echo "OK: Backup age: ${BACKUP_AGE} hours (RPO target: 24h)"
52
+
53
+ echo "=== PASS: Restore complete ==="
54
+ ```
55
+
56
+ ## Game Day Protocol
57
+
58
+ 1. **Announce**: Notify team 1 week in advance, define scope
59
+ 2. **Baseline**: Document current system state
60
+ 3. **Inject**: Simulate failure (rename/delete DB, kill process, etc.)
61
+ 4. **Execute**: Team follows runbook from scratch — no shortcuts
62
+ 5. **Measure**: Record RTO, RPO, issues encountered
63
+ 6. **Retrospective**: What was unclear? What was missing?
64
+
65
+ ## Runbook Template
66
+
67
+ See `docs/DR-RUNBOOK.md` for the full runbook template.
68
+
69
+ ## Related Standards
70
+
71
+ - [Deployment Standards](deployment-standards.md) — deployment pipeline
72
+ - [Chaos Engineering Standards](chaos-engineering-standards.md) — failure injection
73
+ - [Verification Evidence Standards](verification-evidence.md) — drill records
@@ -0,0 +1,73 @@
1
+ # Flaky Test Management Standards
2
+
3
+ ## Overview
4
+
5
+ A single flaky test in a 3000-test suite can erode CI confidence enough that developers start ignoring failures. Once developers learn to "just re-run CI", real bugs slip through. The cost of eliminating flaky tests is always lower than the cost of the false sense of security they create.
6
+
7
+ ## Definition
8
+
9
+ A test is **flaky** if it produces different results (pass/fail) on consecutive runs with the same code. The 2% threshold: if a test fails ≥ 2% of runs on `main` without code changes, it is flaky.
10
+
11
+ ## Detection
12
+
13
+ Most CI systems can detect flakiness automatically:
14
+
15
+ - **GitHub Actions**: Look for `Flaky tests detected` annotations
16
+ - **Manual**: Run `npx vitest run --reporter=verbose` 5 times, look for non-deterministic results
17
+ - **Vitest**: `vitest run --repeat=5` (runs each test 5 times)
18
+
19
+ ## Quarantine Workflow
20
+
21
+ ```
22
+ Detected → Quarantine (< 48h) → Track → Fix or Delete (< 30 days)
23
+ ```
24
+
25
+ ### Quarantine Annotation
26
+
27
+ ```typescript
28
+ // TODO: quarantined 2026-05-05 — flaky race condition, see issue #42
29
+ it.skip("reconnects after WebSocket disconnect", async () => {
30
+ // ... test body preserved for reference
31
+ })
32
+ ```
33
+
34
+ ### Tracking Issue Template
35
+
36
+ ```markdown
37
+ **Flaky Test**: `describe > test name`
38
+ **File**: `src/path/to/test.ts`
39
+ **Quarantined**: 2026-05-05
40
+ **Failure rate**: ~5% on main
41
+ **Known failure mode**: `Cannot read property 'socket' of undefined`
42
+ **Root cause hypothesis**: Race condition in WebSocket teardown
43
+ **Deadline**: 2026-06-05
44
+ ```
45
+
46
+ ## Common Root Causes
47
+
48
+ | Root Cause | Fix |
49
+ |-----------|-----|
50
+ | Race condition | Use `waitFor()`, `vi.waitFor()`, proper async coordination |
51
+ | Shared state | Reset state in `beforeEach`/`afterEach` |
52
+ | External service | Mock the dependency |
53
+ | File system ordering | Use deterministic sort |
54
+ | Random without seed | Set fixed seed in test |
55
+ | Timing-dependent | Fake timers (`vi.useFakeTimers()`) |
56
+
57
+ ## Vitest Configuration
58
+
59
+ ```typescript
60
+ // vitest.config.ts
61
+ export default defineConfig({
62
+ test: {
63
+ retry: 2, // retry failed tests up to 2 times
64
+ testTimeout: 10000, // 10s timeout prevents infinite hangs
65
+ hookTimeout: 5000, // 5s hook timeout
66
+ }
67
+ })
68
+ ```
69
+
70
+ ## Related Standards
71
+
72
+ - [Testing Standards](testing.md) — overall test pyramid
73
+ - [Test Governance Standards](test-governance.md) — CI policies
@@ -0,0 +1,142 @@
1
+ # Flow-Based Testing
2
+
3
+ **Version**: 1.0.0
4
+ **Last Updated**: 2026-05-04
5
+ **Applicability**: All software projects with multi-step workflows
6
+ **Scope**: universal
7
+ **Industry Standards**: ISO/IEC/IEEE 29119-4 (Test Techniques), ISTQB Foundation Syllabus
8
+ **References**: Decision Table Testing (ISTQB), Pairwise Testing, State Transition Testing
9
+
10
+ [English](.) | [繁體中文](../locales/zh-TW/core/flow-based-testing.md)
11
+
12
+ ---
13
+
14
+ ## Purpose
15
+
16
+ This document defines a systematic methodology for testing multi-step processes. It addresses the gap between AC-centric tests (which verify individual behaviors in isolation) and flow-level tests (which verify sequential behavior with accumulated state and branch coverage).
17
+
18
+ ---
19
+
20
+ ## The Core Problem: AC-Centric vs. Flow-Centric Testing
21
+
22
+ AC-centric tests verify that each acceptance criterion works in isolation. However, they miss two critical categories of bugs:
23
+
24
+ 1. **Step interaction bugs**: A bug that only manifests when Step 1's output becomes Step 2's input
25
+ 2. **Branch coverage gaps**: Decision points that are never exercised with all possible values
26
+
27
+ **Example**: A pipeline has 8 steps. Each AC passes independently. But when the quota check in Step 3 depends on state accumulated in Steps 1 and 2, the interaction is never tested.
28
+
29
+ ---
30
+
31
+ ## Three-Step Flow Decomposition
32
+
33
+ ### Step 1: Flow Identification
34
+
35
+ Before writing any test code, document:
36
+
37
+ - **Preconditions**: The system's initial state
38
+ - **Step sequence**: The ordered list of actions (Step 1 → Step N)
39
+ - **Decision points**: Every if/else/condition in the flow
40
+ - **Terminal states**: All possible end states (success + each distinct failure)
41
+
42
+ ### Step 2: Decision Table Expansion
43
+
44
+ For each decision point, list all possible values. Then apply a coverage strategy:
45
+
46
+ | Strategy | When to Use | Scenario Count |
47
+ |----------|-------------|---------------|
48
+ | **Each-Choice** (minimum) | Low-risk flows, fast feedback | Sum of unique values |
49
+ | **Pairwise** | Medium-risk flows | ~N × max_values |
50
+ | **All-Combinations** | Auth, payment, security | Product of value counts |
51
+
52
+ **Decision Table Example**:
53
+
54
+ | Decision Point | Values |
55
+ |----------------|--------|
56
+ | Authorization | valid / expired / missing |
57
+ | Quota | sufficient / exceeded |
58
+ | External Service | available / timeout / error |
59
+
60
+ Each-Choice minimum: 3 + 2 + 3 = 8 scenarios (vs. the typical 1-2 that teams actually write).
61
+
62
+ ### Step 3: Journey Test Structure
63
+
64
+ Write tests with shared state threading — a `ctx` object accumulates state across steps:
65
+
66
+ ```typescript
67
+ describe("Flow: Create Order", () => {
68
+ const ctx: { token?: string; orderId?: string } = {}
69
+
70
+ it("Step 1: Login", async () => {
71
+ ctx.token = await login(credentials)
72
+ expect(ctx.token).toBeTruthy()
73
+ })
74
+
75
+ it("Step 2: Create order (uses Step 1 token)", async () => {
76
+ ctx.orderId = await createOrder(ctx.token!, orderData)
77
+ expect(ctx.orderId).toMatch(/^ord-/)
78
+ })
79
+
80
+ it("Step 3: Verify order state (uses Step 2 orderId)", async () => {
81
+ const order = await getOrder(ctx.token!, ctx.orderId!)
82
+ expect(order.status).toBe("pending")
83
+ })
84
+ })
85
+
86
+ describe("Flow Branch: Quota exceeded path", () => {
87
+ it("should return 429 and NOT create order when quota is exhausted", async () => {
88
+ await exhaustQuota(testUser)
89
+ const response = await attemptCreateOrder(testToken, orderData)
90
+ expect(response.status).toBe(429)
91
+ expect(response.body.code).toBe("QUOTA_EXCEEDED")
92
+ // Verify side effects: no order was created
93
+ const orders = await getOrders(testUser)
94
+ expect(orders.length).toBe(0)
95
+ })
96
+ })
97
+ ```
98
+
99
+ ---
100
+
101
+ ## Anti-Patterns
102
+
103
+ - Testing only the happy path flow (missing failure terminal states)
104
+ - Resetting shared state between steps (breaks state threading)
105
+ - Testing each step in isolation without verifying accumulated state
106
+ - Using a single test for a flow with multiple decision points
107
+ - Applying All-Combinations to every flow (reserve for critical paths only)
108
+ - Not verifying side effects (or absence thereof) in branch tests
109
+
110
+ ---
111
+
112
+ ## Relationship to Other Standards
113
+
114
+ - **test-completeness-dimensions**: Dimensions 9 (Flow Completeness) and 10 (Branch Coverage) are defined here
115
+ - **behavior-driven-development**: BDD Scenario Outline tables map to decision table expansion
116
+ - **mock-boundary**: Flow tests must respect mock boundary rules (no mocking own module logic)
117
+ - **e2e-testing**: Journey tests run at ST or E2E level; flow tests can run at IT level with real DB
118
+
119
+ ---
120
+
121
+ ## Quick Reference Checklist
122
+
123
+ ```
124
+ Flow: ___________________
125
+
126
+ □ Step 1 — Flow Identification
127
+ □ Preconditions documented
128
+ □ Ordered step sequence listed
129
+ □ All decision points extracted
130
+ □ All terminal states defined
131
+
132
+ □ Step 2 — Decision Table
133
+ □ Decision table created
134
+ □ Coverage strategy chosen (Each-Choice / Pairwise / All-Combinations)
135
+ □ Critical flows (auth/payment/security) → All-Combinations
136
+
137
+ □ Step 3 — Journey Test Structure
138
+ □ Happy path journey test (shared ctx, sequential steps)
139
+ □ Each branch outcome has its own describe block
140
+ □ Branch tests verify both response AND absence of side effects
141
+ □ No beforeEach resetting ctx between steps
142
+ ```
@@ -0,0 +1,178 @@
1
+ # LLM 輸出驗證標準
2
+
3
+ > 標準 ID:`llm-output-validation`
4
+ > 版本:v1.0.0
5
+ > 最後更新:2026-05-05
6
+
7
+ ---
8
+
9
+ ## 為什麼需要 LLM 輸出驗證?
10
+
11
+ LLM 輸出具有**不確定性**:同一個 prompt 在不同時間、不同模型版本下可能產生格式不一致的輸出。如果不加以驗證,這些輸出可能在下游管線中造成靜默失敗(silent failure)——不是報錯,而是用了一個錯誤的預設值或 `undefined`。
12
+
13
+ LLM 輸出驗證包含三個層次:
14
+
15
+ | 層次 | 問題 | 工具 |
16
+ |------|------|------|
17
+ | 結構驗證 | 輸出格式是否正確? | JSON Schema、Zod、Pydantic |
18
+ | 語意驗證 | 宣稱的事實是否有根據? | NLI probe、Grounding check |
19
+ | 行為驗證 | Agent 是否正確拒絕越界請求? | 紅隊語料庫、拒絕評估 |
20
+
21
+ ---
22
+
23
+ ## 一、Schema Contract Test(結構驗證)
24
+
25
+ ### 核心概念
26
+
27
+ 每個 AI Agent 應宣告一份 `output-schema.json`(JSON Schema 格式),並提供對應的 contract test。
28
+
29
+ **Contract test 的目的**:
30
+ - 確認 schema 本身是合法的 JSON Schema
31
+ - 確認 valid fixtures 通過驗證
32
+ - 確認 invalid fixtures(缺少必填欄位、型別錯誤、enum 違規)被拒絕
33
+
34
+ ### 推薦目錄結構
35
+
36
+ ```
37
+ agents/<agent-name>/
38
+ output-schema.json ← JSON Schema 定義
39
+ __tests__/
40
+ contract.test.ts ← Contract test suite
41
+ __fixtures__/
42
+ valid.json ← 真實 LLM 輸出 golden fixture
43
+ invalid-missing-id.json ← 缺少必填欄位的 fixture
44
+ ```
45
+
46
+ ### TypeScript 範例(使用 Ajv)
47
+
48
+ ```typescript
49
+ import Ajv from "ajv"
50
+ import schema from "../output-schema.json"
51
+ import validFixture from "../__fixtures__/valid.json"
52
+
53
+ const ajv = new Ajv({ strict: false })
54
+ const validate = ajv.compile(schema)
55
+
56
+ // 測試 1:Schema 本身是合法的 JSON Schema
57
+ it("schema is valid JSON Schema", () => {
58
+ expect(ajv.validateSchema(schema)).toBe(true)
59
+ })
60
+
61
+ // 測試 2:Valid fixture 通過驗證
62
+ it("valid fixture passes schema", () => {
63
+ expect(validate(validFixture)).toBe(true)
64
+ })
65
+
66
+ // 測試 3:空 object 被拒絕
67
+ it("empty object is rejected", () => {
68
+ expect(validate({})).toBe(false)
69
+ })
70
+
71
+ // 測試 4:缺少 source_agent 被拒絕
72
+ it("object missing source_agent is rejected", () => {
73
+ const { source_agent, ...without } = validFixture
74
+ expect(validate(without)).toBe(false)
75
+ })
76
+ ```
77
+
78
+ ### Python 範例(使用 Pydantic)
79
+
80
+ ```python
81
+ from pydantic import ValidationError
82
+ from your_module import AgentOutput
83
+
84
+ # 測試 valid fixture
85
+ valid_data = { "version": "1.0.0", "source_agent": "planner", ... }
86
+ output = AgentOutput(**valid_data) # 不拋出 exception
87
+
88
+ # 測試 invalid fixture
89
+ try:
90
+ AgentOutput(version="bad-format", source_agent="planner")
91
+ assert False, "Should have raised"
92
+ except ValidationError:
93
+ pass # 預期行為
94
+ ```
95
+
96
+ ---
97
+
98
+ ## 二、幻覺偵測(Semantic Validation)
99
+
100
+ ### 什麼是幻覺?
101
+
102
+ LLM 產生「聽起來正確但實際上沒有根據」的內容。例如:
103
+ - 虛構的 API 文件 URL
104
+ - 不存在的資料庫欄位名稱
105
+ - 未在 context 中出現的 dependency 版本
106
+
107
+ ### 偵測策略
108
+
109
+ | 策略 | 適用場景 | 自動化程度 |
110
+ |------|---------|-----------|
111
+ | **Schema 結構化輸出** | Agent 輸出 JSON,enum 限制可能值 | 高(自動) |
112
+ | **Grounding Check** | RAG 系統,回答需引用 context | 中(需 NLI 模型) |
113
+ | **信心度標記** | Agent 在輸出中包含 `confidence` 分數 | 中(需 prompt 設計) |
114
+ | **紅隊語料庫** | 主動測試越界請求的拒絕行為 | 高(自動) |
115
+
116
+ ### 幻覺率目標
117
+
118
+ | Agent 類型 | Schema 合規率 | 事實幻覺率 |
119
+ |-----------|-------------|----------|
120
+ | 結構化 JSON Agent | ≥ 99% | ≤ 5% |
121
+ | RAG Agent | ≥ 95% | ≤ 5% |
122
+ | 對話 Agent | ≥ 90% | ≤ 10% |
123
+
124
+ ---
125
+
126
+ ## 三、Prompt 回歸測試
127
+
128
+ ### 何時需要跑 Prompt 回歸測試?
129
+
130
+ - 修改任何 `agents/*/prompt.md`
131
+ - 模型版本升級(相同 prompt,不同 model)
132
+ - Schema 新增 required field
133
+
134
+ ### 回歸測試流程
135
+
136
+ ```bash
137
+ # 1. 修改前:用 temperature=0 記錄 golden output
138
+ vibeops run planner --input fixtures/planner-input.json --temp 0 > golden.json
139
+
140
+ # 2. 修改後:重跑並比對
141
+ vibeops run planner --input fixtures/planner-input.json --temp 0 > after.json
142
+
143
+ # 3. 用 contract test 驗證 after.json 仍符合 schema
144
+ npx vitest run agents/__tests__/contract.test.ts
145
+ ```
146
+
147
+ ---
148
+
149
+ ## 四、品質閘門(Quality Gates)
150
+
151
+ | 閘門 | 閾值 | 強制程度 |
152
+ |------|------|---------|
153
+ | Schema 合規(CI) | 100% | Block merge |
154
+ | 空 object 拒絕(CI)| 100% | Block merge |
155
+ | Prompt 修改後回歸(CI)| schema 合規維持 | Block merge |
156
+ | 幻覺率(pre-release)| ≤ 5% | Advisory |
157
+
158
+ ---
159
+
160
+ ## 五、工具推薦
161
+
162
+ | 工具 | 語言 | 用途 |
163
+ |------|------|------|
164
+ | [Ajv](https://ajv.js.org/) | TypeScript/JS | JSON Schema contract test |
165
+ | [Zod](https://zod.dev/) | TypeScript | Runtime type validation |
166
+ | [Pydantic](https://docs.pydantic.dev/) | Python | Schema + type validation |
167
+ | [DeepEval](https://deepeval.com/) | Python | LLM 幻覺率、faithfulness 評分 |
168
+ | [Ragas](https://docs.ragas.io/) | Python | RAG grounded answer rate |
169
+
170
+ ---
171
+
172
+ ## 參考標準
173
+
174
+ - NIST AI RMF (AI 100-1, 2023) — AI 風險管理框架
175
+ - OWASP Top 10 for LLM Applications v1.1 — LLM01: Prompt Injection
176
+ - ISO/IEC 42001:2023 — AI 管理系統
177
+ - [UDS `security-testing.ai.yaml`](./security-testing.md) — SAST + DAST 整合
178
+ - [UDS `adversarial-test.ai.yaml`](./adversarial-test.md) — Prompt injection 紅隊標準
@@ -0,0 +1,100 @@
1
+ # Mock Boundary Standards
2
+
3
+ **Version**: 1.0.0
4
+ **Last Updated**: 2026-05-04
5
+ **Applicability**: All software projects with unit and integration tests
6
+ **Scope**: universal
7
+ **Industry Standards**: ISTQB Foundation (Test Doubles), xUnit Patterns (Gerard Meszaros)
8
+ **References**: "Working Effectively with Legacy Code" (Feathers), "Growing Object-Oriented Software" (Freeman & Pryce)
9
+
10
+ [English](.) | [繁體中文](../locales/zh-TW/core/mock-boundary.md)
11
+
12
+ ---
13
+
14
+ ## Purpose
15
+
16
+ This document defines rules for what can and cannot be mocked in tests. Its goal is to prevent **hollow tests** — tests that always pass but fail to detect real bugs because they replace the system's logic with stubs.
17
+
18
+ ---
19
+
20
+ ## The Hollow Test Problem
21
+
22
+ A hollow test mocks so much of the system that the test becomes a specification of mock wiring rather than system behavior. The classic symptom: you can delete the implementation file and the test still passes.
23
+
24
+ **Real example (VibeOps SPEC-002.test.ts)**:
25
+
26
+ ```typescript
27
+ vi.mock('../../src/runner/agent-runner.js') // Core logic replaced
28
+ vi.mock('../../src/runner/guardian-hooks.js') // Core logic replaced
29
+ vi.mock('../../src/runner/prototyper.js') // Core logic replaced
30
+ vi.mock('../../src/runner/iteration-report.js') // Core logic replaced
31
+ vi.mock('../../src/memory/memory-store.js') // Core logic replaced
32
+ vi.mock('node:fs/promises', ...) // I/O replaced
33
+
34
+ // All assertions verify mock call counts — not actual outputs.
35
+ // runPipeline() touches zero real code.
36
+ ```
37
+
38
+ ---
39
+
40
+ ## What You CAN Mock
41
+
42
+ | Category | Examples | Reason |
43
+ |----------|----------|--------|
44
+ | External HTTP services | LLM APIs, payment gateways, email services | Prevents flaky tests; controls response scenarios |
45
+ | Time functions | `Date.now()`, `new Date()`, `setTimeout` | Makes tests deterministic |
46
+ | Environment variables | `process.env.NODE_ENV`, `process.env.LICENSE_KEY` | Enables config variation |
47
+ | File system (unit tests only) | `fs.readFile`, `fs.writeFile` | Avoids I/O in fast unit tests |
48
+ | Cross-module boundaries (with IT counterpart) | Other modules' public APIs | Isolates unit under test |
49
+
50
+ ---
51
+
52
+ ## What You CANNOT Mock
53
+
54
+ | Category | Example Violation | Why Forbidden |
55
+ |----------|-------------------|---------------|
56
+ | Own module's core logic | `vi.mock('./pipeline-runner.js')` in pipeline-runner tests | Makes the test a no-op |
57
+ | Database in IT/flow/E2E tests | `vi.mock('./db/client.js')` in integration tests | Hides query bugs, schema issues |
58
+ | HTTP framework internals | `vi.mock('express')` | Real routing may be broken |
59
+ | Security controls | Always-pass auth middleware stub | Security regressions invisible |
60
+
61
+ ---
62
+
63
+ ## Hollow Test Detection
64
+
65
+ Before submitting a test file, check:
66
+
67
+ 1. **Mock count ≥ import count** → Review: at least one assertion must verify actual output
68
+ 2. **All assertions are `.toHaveBeenCalled()` variants** → Add output-value assertions
69
+ 3. **Mock path matches test subject directory** → Self-referential mock; remove it
70
+ 4. **More mock setup lines than assertion lines** → Likely hollow
71
+
72
+ ---
73
+
74
+ ## Anti-Patterns
75
+
76
+ - **Total Mock Isolation**: Every import mocked; only mock interactions asserted
77
+ - **Mock the World**: External + internal + DB + FS all mocked in one test
78
+ - **Orphan Mock**: Cross-module mock with no integration test counterpart
79
+ - **Security Bypass Mock**: Auth/permission logic replaced with pass-through stub
80
+ - **Database Mock Cascade**: DB returns hardcoded data, hiding real query errors
81
+
82
+ ---
83
+
84
+ ## Rules Summary
85
+
86
+ | Rule | Trigger | Action |
87
+ |------|---------|--------|
88
+ | No self-mock | Test file mocks its own module | Remove mock; let real code run |
89
+ | Real DB in IT/flow | Writing IT or flow test | Use in-memory SQLite or test schema |
90
+ | IT counterpart | Mocking cross-module boundary | Ensure corresponding IT exists |
91
+ | No security mock | Test involves auth/permissions | Use real test user + real token |
92
+ | Hollow review | Mock count ≥ import count | Add output-value assertion |
93
+
94
+ ---
95
+
96
+ ## Relationship to Other Standards
97
+
98
+ - **testing**: Mock boundary rules apply to all test levels in the testing pyramid
99
+ - **test-completeness-dimensions**: Dimension 8 (AI Test Quality) references these rules
100
+ - **flow-based-testing**: Flow tests must follow mock boundary rules
@@ -0,0 +1,97 @@
1
+ # Mutation Testing Standards
2
+
3
+ **Version**: 1.0.0
4
+ **Last Updated**: 2026-05-04
5
+ **Applicability**: All software projects with unit/integration tests
6
+ **Scope**: universal
7
+ **Industry Standards**: ISTQB Foundation Syllabus (test effectiveness metrics)
8
+ **References**: "Introduction to Software Testing" (Ammann & Offutt), Stryker Mutator docs
9
+
10
+ [English](.) | [繁體中文](../locales/zh-TW/core/mutation-testing.md)
11
+
12
+ ---
13
+
14
+ ## Purpose
15
+
16
+ Mutation testing evaluates test suite effectiveness by injecting artificial bugs and checking whether tests detect them. It answers the question that line coverage cannot: **"Do my tests actually verify correct behavior?"**
17
+
18
+ ---
19
+
20
+ ## Key Concept: Mutation Score
21
+
22
+ ```
23
+ Mutation Score = Killed Mutants / (Killed + Survived) × 100%
24
+ ```
25
+
26
+ - **Killed**: Test suite detected the artificial bug (test failed) ✅
27
+ - **Survived**: Test suite missed the bug (tests still pass) ❌
28
+
29
+ A test with `expect(x).toBeDefined()` can achieve 100% line coverage but survive many mutations (because `x` being `null`, `0`, or `"wrong"` all satisfy `.toBeDefined()`).
30
+
31
+ ---
32
+
33
+ ## Tools
34
+
35
+ | Language | Tool | Command |
36
+ |----------|------|---------|
37
+ | TypeScript/JS | Stryker Mutator | `npx stryker run` |
38
+ | Python | mutmut | `mutmut run` |
39
+ | Java | PIT (Pitest) | `mvn pitest:mutationCoverage` |
40
+
41
+ ---
42
+
43
+ ## Thresholds
44
+
45
+ | Module Type | Minimum Score | Enforcement |
46
+ |-------------|--------------|-------------|
47
+ | Auth/License/Payment/Security | 80% | Block release |
48
+ | Standard business logic | 70% | Warning; resolve before next release |
49
+ | AI-generated tests | 50% | Required; reject if below |
50
+ | Overall project | 60% | Track trend; alert on regression |
51
+
52
+ ---
53
+
54
+ ## When to Run
55
+
56
+ | Trigger | Command | Enforcement |
57
+ |---------|---------|-------------|
58
+ | Pre-release gate | `npm run test:mutation` | ≥ 60% overall |
59
+ | Critical module change | `npx stryker run --mutate 'src/auth/**'` | ≥ 80% |
60
+ | AI-generated test review | `npx stryker run` | ≥ 50% |
61
+
62
+ **Never** add mutation testing to commit hooks — it's too slow (10-60 minutes).
63
+
64
+ ---
65
+
66
+ ## Stryker Quick Start (TypeScript + Vitest)
67
+
68
+ ```bash
69
+ npm install --save-dev @stryker-mutator/core @stryker-mutator/vitest-runner
70
+ ```
71
+
72
+ ```json
73
+ // stryker.config.json
74
+ {
75
+ "testRunner": "vitest",
76
+ "coverageAnalysis": "perTest",
77
+ "mutate": ["src/license/**/*.ts", "!src/**/*.test.ts"],
78
+ "thresholds": { "high": 80, "low": 60, "break": 50 }
79
+ }
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Anti-Patterns
85
+
86
+ - Treating line coverage as a proxy for test effectiveness
87
+ - Adding mutation testing to CI for every PR (too slow)
88
+ - Accepting AI-generated tests without mutation score validation
89
+ - Killing mutations by adding `toBeDefined()` assertions
90
+
91
+ ---
92
+
93
+ ## Relationship to Other Standards
94
+
95
+ - `test-completeness-dimensions`: Dimension 8 (AI Test Quality) references mutation score
96
+ - `mock-boundary`: Hollow tests survive many mutations; mock boundary rules prevent hollow tests
97
+ - `testing`: Mutation testing is the quality gate on top of the test pyramid