@hongmaple0820/scale-engine 0.19.0 → 0.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. package/README.en.md +17 -3
  2. package/README.md +143 -9
  3. package/dist/api/cli.js +1187 -30
  4. package/dist/api/cli.js.map +1 -1
  5. package/dist/codegraph/CodeIntelligence.d.ts +135 -0
  6. package/dist/codegraph/CodeIntelligence.js +460 -0
  7. package/dist/codegraph/CodeIntelligence.js.map +1 -0
  8. package/dist/context/ContextBudget.d.ts +90 -0
  9. package/dist/context/ContextBudget.js +322 -0
  10. package/dist/context/ContextBudget.js.map +1 -0
  11. package/dist/eval/WorkflowEval.d.ts +161 -0
  12. package/dist/eval/WorkflowEval.js +379 -0
  13. package/dist/eval/WorkflowEval.js.map +1 -0
  14. package/dist/governance/GovernanceRoi.d.ts +25 -0
  15. package/dist/governance/GovernanceRoi.js +70 -0
  16. package/dist/governance/GovernanceRoi.js.map +1 -0
  17. package/dist/governance/ProgressiveGovernance.d.ts +22 -0
  18. package/dist/governance/ProgressiveGovernance.js +159 -0
  19. package/dist/governance/ProgressiveGovernance.js.map +1 -0
  20. package/dist/memory/MemoryBrain.d.ts +135 -0
  21. package/dist/memory/MemoryBrain.js +635 -0
  22. package/dist/memory/MemoryBrain.js.map +1 -0
  23. package/dist/memory/index.d.ts +1 -0
  24. package/dist/memory/index.js +1 -0
  25. package/dist/memory/index.js.map +1 -1
  26. package/dist/output/GovernanceDashboard.d.ts +57 -0
  27. package/dist/output/GovernanceDashboard.js +250 -0
  28. package/dist/output/GovernanceDashboard.js.map +1 -0
  29. package/dist/output/index.d.ts +2 -0
  30. package/dist/output/index.js +1 -0
  31. package/dist/output/index.js.map +1 -1
  32. package/dist/skills/SkillRadar.d.ts +83 -0
  33. package/dist/skills/SkillRadar.js +384 -0
  34. package/dist/skills/SkillRadar.js.map +1 -0
  35. package/dist/workflow/GovernanceTemplates.js +220 -194
  36. package/dist/workflow/GovernanceTemplates.js.map +1 -1
  37. package/dist/workflow/UpgradeManager.d.ts +140 -0
  38. package/dist/workflow/UpgradeManager.js +434 -0
  39. package/dist/workflow/UpgradeManager.js.map +1 -0
  40. package/docs/CODE_INTELLIGENCE.md +138 -0
  41. package/docs/CONTEXT_BUDGET.md +87 -0
  42. package/docs/GOVERNANCE_DASHBOARD.md +69 -0
  43. package/docs/MEMORY_BRAIN.md +104 -0
  44. package/docs/README.md +17 -8
  45. package/docs/SKILL_RADAR.md +115 -0
  46. package/docs/WORKFLOW_EVAL.md +151 -0
  47. package/docs/start/README.md +5 -1
  48. package/examples/demo-projects/agent-governance-demo/CONTEXT.md +14 -0
  49. package/examples/demo-projects/agent-governance-demo/README.md +32 -21
  50. package/examples/demo-projects/agent-governance-demo/docs/CONTEXT-MAP.md +14 -0
  51. package/examples/demo-projects/agent-governance-demo/package.json +6 -1
  52. package/package.json +7 -1
@@ -0,0 +1,87 @@
1
+ # Context Budget And Progressive Governance
2
+
3
+ Status: implemented baseline
4
+ Since: v0.20 development branch
5
+
6
+ This feature keeps SCALE from becoming its own context pollution source. It separates always-loaded rules from on-demand documents, runtime evidence, historical archives, and generated artifacts.
7
+
8
+ ## Commands
9
+
10
+ Report token cost by context category:
11
+
12
+ ```bash
13
+ scale context budget --json
14
+ ```
15
+
16
+ Write the report to `.scale/context-budget.json`:
17
+
18
+ ```bash
19
+ scale context budget --write
20
+ ```
21
+
22
+ Check thresholds:
23
+
24
+ ```bash
25
+ scale context doctor --max-always 2500 --max-task 8000
26
+ ```
27
+
28
+ Build a lazy-loaded task context pack:
29
+
30
+ ```bash
31
+ scale context pack \
32
+ --task "Review frontend route with browser evidence" \
33
+ --level L \
34
+ --files src/routes/upload.tsx \
35
+ --budget 4000 \
36
+ --json
37
+ ```
38
+
39
+ Evaluate progressive governance mode:
40
+
41
+ ```bash
42
+ scale governance mode \
43
+ --task "Change auth permissions and database migration" \
44
+ --files src/auth/user.ts,migrations/001.sql \
45
+ --requested-mode minimal \
46
+ --json
47
+ ```
48
+
49
+ Report governance benefit and overhead:
50
+
51
+ ```bash
52
+ scale governance roi \
53
+ --task-id TASK-123 \
54
+ --task "Review frontend route with browser evidence" \
55
+ --files src/routes/upload.tsx \
56
+ --json
57
+ ```
58
+
59
+ ## Categories
60
+
61
+ | Category | Meaning | Loading Policy |
62
+ | --- | --- | --- |
63
+ | `always` | Tiny entrypoint rules and source-of-truth governance config | Keep under strict token budget |
64
+ | `on-demand` | Domain docs and governance guides | Load only when task trigger matches |
65
+ | `evidence` | Runtime evidence and task artifacts | Summarize and reference by path |
66
+ | `archive` | Historical plans and old roadmap context | Do not load unless explicitly requested |
67
+ | `generated` | HTML reports, screenshots, graph outputs, generated artifacts | Keep manifest-only by default |
68
+
69
+ ## Progressive Governance
70
+
71
+ SCALE now has a baseline risk classifier. It keeps low-risk documentation work in `minimal` mode and escalates risky tasks to `standard`, `expanded`, or `critical`.
72
+
73
+ Examples:
74
+
75
+ | Signal | Mode |
76
+ | --- | --- |
77
+ | README typo | `minimal` |
78
+ | normal implementation task | `standard` |
79
+ | UI, browser, E2E, public interface, or cross-module work | `expanded` |
80
+ | auth, permission, secret, database, migration, production config, release, or destructive operation | `critical` |
81
+
82
+ This is not a replacement for verification. It only decides which governance behavior should activate.
83
+
84
+ ## Governance ROI
85
+
86
+ `scale governance roi` reports both benefit and overhead. Early ROI is estimated from context budget and risk signals. Later versions should replace estimates with measured eval data such as file reads saved, tool calls saved, fix iterations reduced, and human corrections avoided.
87
+
@@ -0,0 +1,69 @@
1
+ # Governance Dashboard
2
+
3
+ Status: implemented baseline
4
+ Since: v0.25 development branch
5
+
6
+ Governance Dashboard turns existing SCALE evidence into a single reviewable HTML page. It does not replace Markdown, JSON, runtime evidence, eval records, or memory. It is a human-facing view over those sources.
7
+
8
+ ## Command
9
+
10
+ ```bash
11
+ scale artifact dashboard
12
+ scale artifact dashboard --task-id <task-id>
13
+ scale artifact dashboard --dir /path/to/project
14
+ scale artifact dashboard --output docs/worklog/tasks/<task-id>/artifacts/governance-dashboard.html
15
+ scale artifact dashboard --json
16
+ ```
17
+
18
+ Default output:
19
+
20
+ ```text
21
+ .scale/reports/governance-dashboard.html
22
+ .scale/reports/governance-dashboard-manifest.json
23
+ ```
24
+
25
+ The default lifecycle is `generated-report` and the default Git policy is `ignore`. Promote or commit only dashboards that are intentionally used as reviewed task evidence or release evidence.
26
+
27
+ When `--dir` is used and `SCALE_DIR` is not set, the default `.scale` directory is resolved inside the target project directory, not inside the shell's current working directory. This matters for scaffold and multi-repo validation runs.
28
+
29
+ ## Inputs
30
+
31
+ The dashboard reads existing local evidence:
32
+
33
+ | Area | Source |
34
+ | --- | --- |
35
+ | Runtime evidence | `.scale/evidence/runtime/` |
36
+ | Workflow eval | `.scale/evals/runs/` and `.scale/evals/failures/` |
37
+ | Memory Brain | `.scale/memory/brain.sqlite` |
38
+ | Resource Governance | workspace files plus `.scale/resource-policy.json` and `.scale/assets.json` |
39
+ | HTML artifacts | task artifact manifests and rendered HTML files |
40
+
41
+ ## Status Model
42
+
43
+ - Runtime evidence failures are blocking.
44
+ - Memory contradictions are blocking.
45
+ - Resource Governance failures are blocking.
46
+ - Open eval failure replays are warnings, because they may be intentional baseline failures or pending improvement work.
47
+ - Missing task HTML artifacts are informational.
48
+
49
+ This keeps the dashboard useful as a review surface without turning every observation into a hard gate.
50
+
51
+ ## Recommended Use
52
+
53
+ For M/L/CRITICAL work:
54
+
55
+ ```bash
56
+ scale verify <task-id>
57
+ scale eval run --suite workflow-baseline
58
+ scale memory dream --json
59
+ scale artifact dashboard --task-id <task-id>
60
+ ```
61
+
62
+ For release review:
63
+
64
+ ```bash
65
+ scale artifact dashboard
66
+ scale artifact open --artifact-dir .scale/reports --type governance-dashboard --print-only
67
+ ```
68
+
69
+ The dashboard should be attached to a release or PR only when it is deliberately selected as a review artifact. Routine generated dashboards should stay local.
@@ -0,0 +1,104 @@
1
+ # Memory Brain
2
+
3
+ Memory Brain is SCALE's project-scoped long-term memory layer. It is separate from Memory Fabric:
4
+
5
+ - Memory Fabric builds a compact context pack for the current task.
6
+ - Memory Brain stores reviewed project knowledge with evidence, confidence, scope, and contradiction checks.
7
+
8
+ The first version is local-first and uses SQLite:
9
+
10
+ ```text
11
+ .scale/memory/brain.sqlite
12
+ .scale/memory/brain-manifest.json
13
+ ```
14
+
15
+ ## Commands
16
+
17
+ ```bash
18
+ scale memory ingest --from evidence --task-id <task-id>
19
+ scale memory ingest --from candidate --candidate-id <candidate-id>
20
+ scale memory ingest --from failure --failure-id <failure-replay-id>
21
+ scale memory query "OAuth callback state design"
22
+ scale memory contradictions
23
+ scale memory dream
24
+ scale memory promote <memory-node-id-or-candidate-id>
25
+ scale memory export --output .scale/memory/export.jsonl
26
+ scale memory import .scale/memory/export.jsonl
27
+ ```
28
+
29
+ ## Node Contract
30
+
31
+ ```ts
32
+ interface MemoryNode {
33
+ id: string
34
+ type: 'fact' | 'decision' | 'incident' | 'relation' | 'contradiction'
35
+ title: string
36
+ summary: string
37
+ entities: string[]
38
+ source: 'runtime-evidence' | 'task-artifact' | 'docs' | 'git' | 'manual'
39
+ evidencePaths: string[]
40
+ confidence: number
41
+ scope: 'project' | 'workspace' | 'global-candidate'
42
+ status: 'candidate' | 'active' | 'stale' | 'rejected'
43
+ createdAt: string
44
+ updatedAt: string
45
+ lastVerifiedAt?: string
46
+ }
47
+ ```
48
+
49
+ ## Evidence Rule
50
+
51
+ Active memory must have at least one evidence path. SCALE blocks promotion when this is not true.
52
+
53
+ Runtime evidence and learning candidates are ingested as `candidate` records first. `scale memory promote` is the explicit boundary where reviewed memory becomes active.
54
+
55
+ Failure replay records can also be ingested as `incident` candidates:
56
+
57
+ ```bash
58
+ scale eval run --suite workflow-baseline
59
+ scale eval failures --since 30d
60
+ scale memory ingest --from failure --failure-id <failure-replay-id>
61
+ scale memory promote <memory-node-id>
62
+ ```
63
+
64
+ This connects Eval Harness failures to long-term memory without automatically rewriting project standards. A failure becomes active memory only after promotion and only if the replay artifact is present as evidence.
65
+
66
+ ## Scope Rule
67
+
68
+ Project memory stays project-scoped by default. `global-candidate` is allowed for export and review, but it cannot be activated inside a project brain. This prevents one project's temporary truth from becoming a global rule.
69
+
70
+ ## Contradiction Rule
71
+
72
+ `scale memory contradictions` reports conflicts instead of resolving them automatically. Examples:
73
+
74
+ - one memory says a provider is enabled, another says it is disabled
75
+ - one memory says a route exists, another says it is missing
76
+ - one memory says an operation is allowed, another says it is blocked
77
+
78
+ The command exits non-zero when active contradictions exist.
79
+
80
+ ## Dream Maintenance
81
+
82
+ `scale memory dream` is a maintenance pass. It reports:
83
+
84
+ - promotion candidates
85
+ - stale active memories
86
+ - duplicate groups
87
+ - contradictions
88
+ - suggested docs to update
89
+ - active memories missing evidence
90
+
91
+ It does not auto-promote standards, rewrite docs, or delete memories.
92
+
93
+ ## Resource Lifecycle
94
+
95
+ Memory Brain files under `.scale/memory/` are local runtime state by default. Commit only curated exports, documented decisions, or task artifacts that were intentionally reviewed.
96
+
97
+ Recommended flow:
98
+
99
+ ```text
100
+ runtime evidence -> memory settle -> memory ingest -> memory promote -> docs/standards update when stable
101
+ eval failure replay -> memory ingest --from failure -> memory promote -> workflow rule update when stable
102
+ ```
103
+
104
+ This keeps memory useful without turning every session observation into permanent project truth.
package/docs/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # SCALE Engine 文档地图
2
2
 
3
- 这个目录同时包含用户指南、参考文档、历史规划和推广材料。为了避免新用户迷路,请按下面的分层阅读。
3
+ 这个目录同时包含用户指南、治理能力说明、架构参考、历史规划和推广素材。新用户应优先阅读入门入口和当前治理能力文档,历史规划仅作为背景材料。
4
4
 
5
5
  ## 新用户入口
6
6
 
@@ -20,10 +20,17 @@
20
20
  | [TOOL_ORCHESTRATION.md](TOOL_ORCHESTRATION.md) | skills、MCP、CLI、浏览器、桌面自动化的编排策略 |
21
21
  | [RUNTIME_EVIDENCE.md](RUNTIME_EVIDENCE.md) | 会话 ledger、运行时证据和最终交付检查 |
22
22
  | [MEMORY_FABRIC.md](MEMORY_FABRIC.md) | Runtime evidence、session events、knowledge recall 和 graph status 的预算化上下文包 |
23
+ | [MEMORY_BRAIN.md](MEMORY_BRAIN.md) | 证据驱动的长期记忆、矛盾检测、dream 整理和 failure replay 沉淀 |
24
+ | [CONTEXT_BUDGET.md](CONTEXT_BUDGET.md) | Context Budget、Progressive Governance、Lazy Loading 和 Governance ROI |
25
+ | [CODE_INTELLIGENCE.md](CODE_INTELLIGENCE.md) | CodeGraph、Graphify 和显式 fallback 的代码智能与探索 ROI |
26
+ | [WORKFLOW_EVAL.md](WORKFLOW_EVAL.md) | Workflow Eval、pass@k 指标、Failure Replay 和改进候选 |
27
+ | [SKILL_RADAR.md](SKILL_RADAR.md) | Skill Radar、能力置信度、证据要求和供应链安全检查 |
28
+ | [UPGRADE_MANAGEMENT.md](UPGRADE_MANAGEMENT.md) | SCALE CLI、governance pack、skills、MCP 和 CLI 工具的安全升级流程 |
29
+ | [GOVERNANCE_DASHBOARD.md](GOVERNANCE_DASHBOARD.md) | Runtime、eval、memory、resource、HTML artifact 的统一治理面板 |
23
30
  | [RELEASE_READINESS.md](RELEASE_READINESS.md) | 发版前质量门槛、官方 demo 和真实项目落地验收 |
24
31
  | [SKILL-REPOSITORY.md](SKILL-REPOSITORY.md) | 受治理 skill repository 和安装安全策略 |
25
32
  | [VIBE-TEMPLATES.md](VIBE-TEMPLATES.md) | 可复制的 Vibe Coding 提示词模板 |
26
- | [LEADERSHIP-PRESETS.md](LEADERSHIP-PRESETS.md) | CEO/CTO/PM/Architect 等内置领导者角色预设 |
33
+ | [LEADERSHIP-PRESETS.md](LEADERSHIP-PRESETS.md) | CEOCTOPMArchitect 等内置领导者角色预设 |
27
34
 
28
35
  ## 架构与参考
29
36
 
@@ -47,6 +54,7 @@
47
54
  | [WEEK1-2-REPORT.md](WEEK1-2-REPORT.md) | 阶段报告 |
48
55
  | [TASK_GUARD_SUMMARY.md](TASK_GUARD_SUMMARY.md) | Task Guard 总结 |
49
56
  | [TASK_GUARD_WORKFLOW_DEMO.md](TASK_GUARD_WORKFLOW_DEMO.md) | 早期 workflow demo |
57
+ | [plans/2026-05-19-agent-engineering-os-upgrade-plan.md](plans/2026-05-19-agent-engineering-os-upgrade-plan.md) | Agent Engineering OS 升级审核稿:Context Budget、CodeGraph、Memory Brain、Skill Radar、HTML Artifact 和 Eval Harness |
50
58
  | [plans/](plans/) | 规划方案和技术方案归档 |
51
59
  | [superpowers/](superpowers/) | 外部方法论对照和计划归档 |
52
60
 
@@ -54,15 +62,16 @@
54
62
 
55
63
  | 文档 | 说明 |
56
64
  | --- | --- |
57
- | [promote-article-v3.md](promote-article-v3.md) | 推广文章草稿 |
58
- | [promote-article-v3.html](promote-article-v3.html) | 推广文章 HTML 版本 |
65
+ | [promote-article-v2.md](promote-article-v2.md) | 推广文章草稿 v2 |
66
+ | [promote-article-v2.html](promote-article-v2.html) | 推广文章 HTML v2 |
67
+ | [promote-article-v3.md](promote-article-v3.md) | 推广文章草稿 v3 |
68
+ | [promote-article-v3.html](promote-article-v3.html) | 推广文章 HTML v3 |
59
69
  | [imgs/](imgs/) | 社群二维码和推广图片 |
60
70
 
61
71
  ## 维护规则
62
72
 
63
73
  - 面向新用户的文档优先放在 `docs/start/`。
64
74
  - 当前可执行能力放在根 README 和当前治理能力文档中。
65
- - 历史规划不要混入新手教程,避免用户把旧计划当当前事实。
66
- - 如果 CLI 行为变化,必须同步更新 README、`docs/start/quickstart.md` 和相关 reference 文档。
67
- - 如果新增 governance pack,必须同时更新 README、`docs/start/README.md` 和对应测试。
68
-
75
+ - 历史规划不要混入新手教程,避免用户把旧计划当成当前事实。
76
+ - 如果 CLI 行为变化,必须同步更新 `README.md`、`docs/start/quickstart.md` 和相关 reference 文档。
77
+ - 如果新增 governance pack,必须同时更新 `README.md`、`docs/start/README.md` 和对应测试。
@@ -0,0 +1,115 @@
1
+ # Skill Radar
2
+
3
+ Skill Radar is the active capability selection layer for SCALE. It does not auto-install or blindly run skills. It scores relevant skills, MCP servers, browser tools, desktop automation, and external CLIs against the current task, then returns:
4
+
5
+ - why the capability matches
6
+ - confidence score
7
+ - safety level
8
+ - required evidence
9
+ - fallback path
10
+ - supply-chain checks before installation or promotion
11
+
12
+ The goal is to make agents actively use useful tools without turning the project into an unsafe prompt or tool bundle.
13
+
14
+ ## Commands
15
+
16
+ ```bash
17
+ scale skill radar --task "Design upload UI and run browser E2E checks" --files src/pages/upload.tsx
18
+ scale skill radar --task "Automate WPS desktop workflow with CUA" --json
19
+ scale skill radar --task "Review release PR" --phase review --level L --output docs/worklog/tasks/release/skill-radar.md
20
+ scale skill doctor --supply-chain
21
+ scale skill doctor --supply-chain --json
22
+ ```
23
+
24
+ ## Safety Levels
25
+
26
+ | Level | Meaning | Default action |
27
+ | --- | --- | --- |
28
+ | `trusted` | Official or low-risk capability with policy enabled | May be recommended when confidence is high |
29
+ | `review-required` | Third-party or ecosystem capability | Require source, license, scripts, and revision review |
30
+ | `restricted` | Browser, desktop, or external execution boundary | Require explicit evidence and side-effect boundaries |
31
+ | `blocked` | Disabled by policy or failed safety review | Do not run; use fallback |
32
+
33
+ ## Confidence
34
+
35
+ Skill Radar combines:
36
+
37
+ - task keywords and workflow phase
38
+ - changed file patterns
39
+ - local skill installation
40
+ - tool availability
41
+ - trust level
42
+ - policy status
43
+ - frontend/package evidence
44
+ - safety penalties
45
+
46
+ The score is not a promise that the tool will work. It is a routing signal. Any recommendation still needs real evidence before the agent can claim success.
47
+
48
+ ## Default Domains
49
+
50
+ | Domain | Typical triggers | Recommended capability types |
51
+ | --- | --- | --- |
52
+ | `ui` | UI, UX, frontend, component, visual, layout | design skills, visual review, screenshot evidence |
53
+ | `browserAutomation` | browser, E2E, Playwright, Chrome, DevTools | web access, browser automation, DevTools evidence |
54
+ | `desktopAutomation` | desktop, GUI, WPS, WeChat, CUA | disabled by default; manual operator fallback |
55
+ | `externalCli` | Codex, Gemini, OpenCode, external agent CLI | disabled by default; dry-run and output evidence |
56
+ | `review` | PR, merge, release, code review | reviewer skills, severity findings |
57
+ | `docs` | docs, README, ADR, governance asset | doc impact and source-of-truth evidence |
58
+ | `discovery` | skill, MCP, tool, capability discovery | find-skills plus safety review |
59
+
60
+ ## Evidence Contract
61
+
62
+ Each recommendation carries required evidence. Examples:
63
+
64
+ - UI work: `ui-spec`, `design-rationale`, `screenshot`, `visual-review`
65
+ - Browser work: `browser-evidence`, `console-summary`, `network-summary`, `scenario-result`
66
+ - Desktop work: `operator-boundary`, `desktop-screenshot`, `affected-app`
67
+ - External CLI work: `cli-version-check`, `command`, `exit-code`, `output-summary`
68
+ - Review work: `review-report`, `finding-list`, `severity`
69
+
70
+ If evidence is missing, the final delivery should list the capability as unverified rather than claiming it was used successfully.
71
+
72
+ ## Supply-Chain Doctor
73
+
74
+ `scale skill doctor --supply-chain` reviews known skill sources and install commands for:
75
+
76
+ - HTTPS source requirement
77
+ - `curl | bash`, `wget | sh`, `Invoke-Expression`, and `iex` blocking
78
+ - destructive install patterns
79
+ - npm/npx lifecycle script review
80
+ - required source, license, and revision checks
81
+
82
+ This is intentionally conservative. Third-party skills should start in review-required mode and be promoted only after inspection.
83
+
84
+ ## Policy Integration
85
+
86
+ Skill Radar reads `.scale/tools.json` through the Tool Policy layer. Defaults:
87
+
88
+ - UI and browser capabilities are enabled but evidence-required.
89
+ - Desktop CUA is disabled by default.
90
+ - External agent CLIs are disabled by default.
91
+ - Browser tools require captured evidence and should stay in approved domains.
92
+
93
+ Use Tool Policy to enable a restricted capability deliberately rather than relying on an agent's assumption.
94
+
95
+ ## Fallback Rule
96
+
97
+ Every recommendation must include a fallback. This prevents tool theater:
98
+
99
+ ```text
100
+ If the capability is missing, unsafe, low-confidence, or policy-blocked,
101
+ the agent must use the fallback and record why the capability was not used.
102
+ ```
103
+
104
+ ## Artifact Lifecycle
105
+
106
+ Skill Radar reports can be written into task artifacts:
107
+
108
+ ```bash
109
+ scale skill radar \
110
+ --task "Refactor upload page and verify browser flow" \
111
+ --files src/pages/upload.tsx \
112
+ --output docs/worklog/tasks/2026-05-19-upload-refactor/skill-radar.md
113
+ ```
114
+
115
+ Keep the report when it is evidence for an M/L/CRITICAL task. Do not commit transient local detection output unless it is part of the reviewed task artifact set.
@@ -0,0 +1,151 @@
1
+ # Workflow Eval Harness
2
+
3
+ Status: implemented baseline
4
+ Since: v0.22 development branch
5
+
6
+ Workflow Eval Harness 用来证明工作流是否真的提升了 Agent 的工程交付质量,而不是只依赖主观感觉。它会运行轻量 eval suite,记录 pass@k、修复迭代、工具调用、token 估算、人类纠偏次数,并在失败时保留 Failure Replay。
7
+
8
+ ## Commands
9
+
10
+ 初始化默认基线套件:
11
+
12
+ ```bash
13
+ scale eval init
14
+ scale eval init --suite workflow-baseline --json
15
+ ```
16
+
17
+ 运行套件:
18
+
19
+ ```bash
20
+ scale eval run --suite workflow-baseline
21
+ scale eval run --suite workflow-baseline --json
22
+ ```
23
+
24
+ 对比两次运行:
25
+
26
+ ```bash
27
+ scale eval compare --baseline <run-id> --candidate <run-id>
28
+ scale eval compare --baseline <run-id> --candidate <run-id> --json
29
+ ```
30
+
31
+ 生成 Markdown 报告:
32
+
33
+ ```bash
34
+ scale eval report --run <run-id>
35
+ scale eval report --run <run-id> --output docs/worklog/eval-report.md
36
+ ```
37
+
38
+ 查看和提升失败重放:
39
+
40
+ ```bash
41
+ scale eval failures --since 30d
42
+ scale eval replay <failure-id>
43
+ scale eval replay --task-id <task-id>
44
+ scale eval promote-failure <failure-id>
45
+ ```
46
+
47
+ ## Failure Replay To Memory
48
+
49
+ Failure Replay is local eval evidence first. When a failure pattern is useful for future work, ingest it into Memory Brain as an `incident` candidate:
50
+
51
+ ```bash
52
+ scale memory ingest --from failure --failure-id <failure-id>
53
+ scale memory query "missing verification evidence"
54
+ scale memory promote <memory-node-id>
55
+ ```
56
+
57
+ This does not auto-change standards or hooks. It only makes the failure queryable and evidence-backed so repeated mistakes can be promoted deliberately after review.
58
+
59
+ ## Storage
60
+
61
+ ```text
62
+ .scale/evals/
63
+ ├── suites/
64
+ ├── runs/
65
+ ├── failures/
66
+ └── improvements/
67
+ ```
68
+
69
+ These files are local runtime evidence by default. Commit only curated summaries or intentional benchmark fixtures.
70
+
71
+ ## Suite Shape
72
+
73
+ ```json
74
+ {
75
+ "version": "1.0",
76
+ "id": "workflow-baseline",
77
+ "name": "SCALE workflow baseline",
78
+ "cases": [
79
+ {
80
+ "id": "governance-command-smoke",
81
+ "type": "bugfix",
82
+ "title": "Command evidence smoke",
83
+ "task": "Verify that a local command can produce concrete eval evidence.",
84
+ "phase": "verify",
85
+ "successCriteria": ["command exits 0"],
86
+ "attempts": [
87
+ {
88
+ "id": "attempt-1",
89
+ "command": "node -e \"console.log('scale-eval-ok')\"",
90
+ "expectedExitCode": 0,
91
+ "outputContains": "scale-eval-ok"
92
+ }
93
+ ]
94
+ }
95
+ ]
96
+ }
97
+ ```
98
+
99
+ ## Metrics
100
+
101
+ | Metric | Meaning |
102
+ | --- | --- |
103
+ | `passAt1Rate` | 一次完整尝试就通过的比例 |
104
+ | `passAt3Rate` | 三次以内通过的比例 |
105
+ | `averageFixIterations` | 首次失败后的平均修复循环 |
106
+ | `totalToolCalls` | eval attempts 数量,可近似衡量工具调用成本 |
107
+ | `estimatedTokens` | task 与输出摘要的估算 token 成本 |
108
+ | `humanCorrections` | 人类纠偏次数 |
109
+ | `failureReplayCount` | 失败重放记录数量 |
110
+
111
+ ## Failure Replay
112
+
113
+ 失败不只记录最终失败状态,还会保存:
114
+
115
+ - task and success criteria
116
+ - phase
117
+ - wrong turn
118
+ - evidence
119
+ - correction
120
+ - prevention
121
+ - replay command
122
+ - redaction status
123
+
124
+ Failure category 当前包括:
125
+
126
+ - `wrong-exploration-path`
127
+ - `hallucinated-project-fact`
128
+ - `missing-codegraph-or-graph-fallback`
129
+ - `over-broad-context-load`
130
+ - `bad-skill-recommendation`
131
+ - `missing-verification-evidence`
132
+ - `failed-security-or-resource-gate`
133
+ - `human-correction-after-agent-confidence`
134
+ - `command-failure`
135
+ - `unknown`
136
+
137
+ `scale eval promote-failure` 会把失败重放提升为 improvement candidate,但不会自动修改项目规范。是否进入长期标准仍需要人工或后续 review 确认。
138
+
139
+ ## Governance Use
140
+
141
+ - v0.22 的默认 suite 是轻量 smoke baseline,用来验证 eval 管线可运行。
142
+ - 真实项目应逐步增加 bugfix、feature、security、frontend、release、resource 类型案例。
143
+ - Failure Replay 应与 Resource Governance 配合:默认本地保留,只有总结、基准或明确要长期维护的案例才提交。
144
+ - Workflow Eval 的数据可以进入后续 Governance ROI,用来判断某个治理模块是否真的减少 rework、tool calls、token 或人类纠偏。
145
+
146
+ ## Policy
147
+
148
+ - 不允许用 eval 通过率替代真实项目验证。
149
+ - 失败记录中的命令输出会做基础脱敏,但仍应避免把敏感原始日志写入 suite。
150
+ - 低成本 smoke suite 可以频繁运行;重型项目 suite 应按需运行。
151
+ - 没有 eval 证据时,不应宣称工作流能力已经提升。
@@ -13,7 +13,10 @@
13
13
  3. 回到根目录 [README](../../README.md)
14
14
  理解 SCALE Engine 的核心能力和 governance pack 选择。
15
15
 
16
- 4. 查看 [文档地图](../README.md)
16
+ 4. [升级管理](../UPGRADE_MANAGEMENT.md)
17
+ 理解工作流更新、第三方 skills/MCP/CLI 更新时如何先检查、生成计划、避免覆盖本地改动。
18
+
19
+ 5. 查看 [文档地图](../README.md)
17
20
  区分哪些文档是用户指南、哪些是参考资料、哪些是历史规划和过程记录。
18
21
 
19
22
  ## 你应该先看到什么
@@ -39,4 +42,5 @@
39
42
  | Go 多服务后端 | `scale init --governance-pack go-service-matrix` |
40
43
  | 多仓库/MOE 工作区 | `scale init --governance-pack moe-workspace` |
41
44
  | 文档、报告、截图、脚本混乱 | `scale init --governance-pack resource-governance` |
45
+ | 工作流或第三方能力要升级 | `scale upgrade check && scale upgrade plan --html` |
42
46
 
@@ -0,0 +1,14 @@
1
+ # CONTEXT.md
2
+
3
+ Project: Agent Governance Demo
4
+
5
+ | Term | Definition | Examples | Aliases | Source |
6
+ |------|------------|----------|---------|--------|
7
+ | OAuth state | One-time callback correlation value that binds authorization return traffic to a user session | `state-123` | callback state | `src/oauth-state.ts` |
8
+ | Consumed state | A state record that has already been used and must not be accepted again | `consumedAt: 900` | replayed state | `tests/oauth-state.test.ts` |
9
+ | Evidence | A command result or artifact that proves what was verified | `npm test`, eval report, dashboard | verification proof | SCALE workflow |
10
+
11
+ ## Rejected Meanings
12
+
13
+ - Do not treat an expired state as recoverable without a new authorization flow.
14
+ - Do not treat a dashboard or eval report as a substitute for the business test.