@hongmaple0820/scale-engine 0.19.0 → 0.21.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.en.md +17 -3
- package/README.md +143 -9
- package/dist/api/cli.js +1187 -30
- package/dist/api/cli.js.map +1 -1
- package/dist/codegraph/CodeIntelligence.d.ts +135 -0
- package/dist/codegraph/CodeIntelligence.js +460 -0
- package/dist/codegraph/CodeIntelligence.js.map +1 -0
- package/dist/context/ContextBudget.d.ts +90 -0
- package/dist/context/ContextBudget.js +322 -0
- package/dist/context/ContextBudget.js.map +1 -0
- package/dist/eval/WorkflowEval.d.ts +161 -0
- package/dist/eval/WorkflowEval.js +379 -0
- package/dist/eval/WorkflowEval.js.map +1 -0
- package/dist/governance/GovernanceRoi.d.ts +25 -0
- package/dist/governance/GovernanceRoi.js +70 -0
- package/dist/governance/GovernanceRoi.js.map +1 -0
- package/dist/governance/ProgressiveGovernance.d.ts +22 -0
- package/dist/governance/ProgressiveGovernance.js +159 -0
- package/dist/governance/ProgressiveGovernance.js.map +1 -0
- package/dist/memory/MemoryBrain.d.ts +135 -0
- package/dist/memory/MemoryBrain.js +635 -0
- package/dist/memory/MemoryBrain.js.map +1 -0
- package/dist/memory/index.d.ts +1 -0
- package/dist/memory/index.js +1 -0
- package/dist/memory/index.js.map +1 -1
- package/dist/output/GovernanceDashboard.d.ts +57 -0
- package/dist/output/GovernanceDashboard.js +250 -0
- package/dist/output/GovernanceDashboard.js.map +1 -0
- package/dist/output/index.d.ts +2 -0
- package/dist/output/index.js +1 -0
- package/dist/output/index.js.map +1 -1
- package/dist/skills/SkillRadar.d.ts +83 -0
- package/dist/skills/SkillRadar.js +384 -0
- package/dist/skills/SkillRadar.js.map +1 -0
- package/dist/workflow/GovernanceTemplates.js +220 -194
- package/dist/workflow/GovernanceTemplates.js.map +1 -1
- package/dist/workflow/UpgradeManager.d.ts +140 -0
- package/dist/workflow/UpgradeManager.js +434 -0
- package/dist/workflow/UpgradeManager.js.map +1 -0
- package/docs/CODE_INTELLIGENCE.md +138 -0
- package/docs/CONTEXT_BUDGET.md +87 -0
- package/docs/GOVERNANCE_DASHBOARD.md +69 -0
- package/docs/MEMORY_BRAIN.md +104 -0
- package/docs/README.md +17 -8
- package/docs/SKILL_RADAR.md +115 -0
- package/docs/WORKFLOW_EVAL.md +151 -0
- package/docs/start/README.md +5 -1
- package/examples/demo-projects/agent-governance-demo/CONTEXT.md +14 -0
- package/examples/demo-projects/agent-governance-demo/README.md +32 -21
- package/examples/demo-projects/agent-governance-demo/docs/CONTEXT-MAP.md +14 -0
- package/examples/demo-projects/agent-governance-demo/package.json +6 -1
- package/package.json +7 -1
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
# Context Budget And Progressive Governance
|
|
2
|
+
|
|
3
|
+
Status: implemented baseline
|
|
4
|
+
Since: v0.20 development branch
|
|
5
|
+
|
|
6
|
+
This feature keeps SCALE from becoming its own context pollution source. It separates always-loaded rules from on-demand documents, runtime evidence, historical archives, and generated artifacts.
|
|
7
|
+
|
|
8
|
+
## Commands
|
|
9
|
+
|
|
10
|
+
Report token cost by context category:
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
scale context budget --json
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
Write the report to `.scale/context-budget.json`:
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
scale context budget --write
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
Check thresholds:
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
scale context doctor --max-always 2500 --max-task 8000
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Build a lazy-loaded task context pack:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
scale context pack \
|
|
32
|
+
--task "Review frontend route with browser evidence" \
|
|
33
|
+
--level L \
|
|
34
|
+
--files src/routes/upload.tsx \
|
|
35
|
+
--budget 4000 \
|
|
36
|
+
--json
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Evaluate progressive governance mode:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
scale governance mode \
|
|
43
|
+
--task "Change auth permissions and database migration" \
|
|
44
|
+
--files src/auth/user.ts,migrations/001.sql \
|
|
45
|
+
--requested-mode minimal \
|
|
46
|
+
--json
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Report governance benefit and overhead:
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
scale governance roi \
|
|
53
|
+
--task-id TASK-123 \
|
|
54
|
+
--task "Review frontend route with browser evidence" \
|
|
55
|
+
--files src/routes/upload.tsx \
|
|
56
|
+
--json
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Categories
|
|
60
|
+
|
|
61
|
+
| Category | Meaning | Loading Policy |
|
|
62
|
+
| --- | --- | --- |
|
|
63
|
+
| `always` | Tiny entrypoint rules and source-of-truth governance config | Keep under strict token budget |
|
|
64
|
+
| `on-demand` | Domain docs and governance guides | Load only when task trigger matches |
|
|
65
|
+
| `evidence` | Runtime evidence and task artifacts | Summarize and reference by path |
|
|
66
|
+
| `archive` | Historical plans and old roadmap context | Do not load unless explicitly requested |
|
|
67
|
+
| `generated` | HTML reports, screenshots, graph outputs, generated artifacts | Keep manifest-only by default |
|
|
68
|
+
|
|
69
|
+
## Progressive Governance
|
|
70
|
+
|
|
71
|
+
SCALE now has a baseline risk classifier. It keeps low-risk documentation work in `minimal` mode and escalates risky tasks to `standard`, `expanded`, or `critical`.
|
|
72
|
+
|
|
73
|
+
Examples:
|
|
74
|
+
|
|
75
|
+
| Signal | Mode |
|
|
76
|
+
| --- | --- |
|
|
77
|
+
| README typo | `minimal` |
|
|
78
|
+
| normal implementation task | `standard` |
|
|
79
|
+
| UI, browser, E2E, public interface, or cross-module work | `expanded` |
|
|
80
|
+
| auth, permission, secret, database, migration, production config, release, or destructive operation | `critical` |
|
|
81
|
+
|
|
82
|
+
This is not a replacement for verification. It only decides which governance behavior should activate.
|
|
83
|
+
|
|
84
|
+
## Governance ROI
|
|
85
|
+
|
|
86
|
+
`scale governance roi` reports both benefit and overhead. Early ROI is estimated from context budget and risk signals. Later versions should replace estimates with measured eval data such as file reads saved, tool calls saved, fix iterations reduced, and human corrections avoided.
|
|
87
|
+
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
# Governance Dashboard
|
|
2
|
+
|
|
3
|
+
Status: implemented baseline
|
|
4
|
+
Since: v0.25 development branch
|
|
5
|
+
|
|
6
|
+
Governance Dashboard turns existing SCALE evidence into a single reviewable HTML page. It does not replace Markdown, JSON, runtime evidence, eval records, or memory. It is a human-facing view over those sources.
|
|
7
|
+
|
|
8
|
+
## Command
|
|
9
|
+
|
|
10
|
+
```bash
|
|
11
|
+
scale artifact dashboard
|
|
12
|
+
scale artifact dashboard --task-id <task-id>
|
|
13
|
+
scale artifact dashboard --dir /path/to/project
|
|
14
|
+
scale artifact dashboard --output docs/worklog/tasks/<task-id>/artifacts/governance-dashboard.html
|
|
15
|
+
scale artifact dashboard --json
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
Default output:
|
|
19
|
+
|
|
20
|
+
```text
|
|
21
|
+
.scale/reports/governance-dashboard.html
|
|
22
|
+
.scale/reports/governance-dashboard-manifest.json
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
The default lifecycle is `generated-report` and the default Git policy is `ignore`. Promote or commit only dashboards that are intentionally used as reviewed task evidence or release evidence.
|
|
26
|
+
|
|
27
|
+
When `--dir` is used and `SCALE_DIR` is not set, the default `.scale` directory is resolved inside the target project directory, not inside the shell's current working directory. This matters for scaffold and multi-repo validation runs.
|
|
28
|
+
|
|
29
|
+
## Inputs
|
|
30
|
+
|
|
31
|
+
The dashboard reads existing local evidence:
|
|
32
|
+
|
|
33
|
+
| Area | Source |
|
|
34
|
+
| --- | --- |
|
|
35
|
+
| Runtime evidence | `.scale/evidence/runtime/` |
|
|
36
|
+
| Workflow eval | `.scale/evals/runs/` and `.scale/evals/failures/` |
|
|
37
|
+
| Memory Brain | `.scale/memory/brain.sqlite` |
|
|
38
|
+
| Resource Governance | workspace files plus `.scale/resource-policy.json` and `.scale/assets.json` |
|
|
39
|
+
| HTML artifacts | task artifact manifests and rendered HTML files |
|
|
40
|
+
|
|
41
|
+
## Status Model
|
|
42
|
+
|
|
43
|
+
- Runtime evidence failures are blocking.
|
|
44
|
+
- Memory contradictions are blocking.
|
|
45
|
+
- Resource Governance failures are blocking.
|
|
46
|
+
- Open eval failure replays are warnings, because they may be intentional baseline failures or pending improvement work.
|
|
47
|
+
- Missing task HTML artifacts are informational.
|
|
48
|
+
|
|
49
|
+
This keeps the dashboard useful as a review surface without turning every observation into a hard gate.
|
|
50
|
+
|
|
51
|
+
## Recommended Use
|
|
52
|
+
|
|
53
|
+
For M/L/CRITICAL work:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
scale verify <task-id>
|
|
57
|
+
scale eval run --suite workflow-baseline
|
|
58
|
+
scale memory dream --json
|
|
59
|
+
scale artifact dashboard --task-id <task-id>
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
For release review:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
scale artifact dashboard
|
|
66
|
+
scale artifact open --artifact-dir .scale/reports --type governance-dashboard --print-only
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
The dashboard should be attached to a release or PR only when it is deliberately selected as a review artifact. Routine generated dashboards should stay local.
|
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
# Memory Brain
|
|
2
|
+
|
|
3
|
+
Memory Brain is SCALE's project-scoped long-term memory layer. It is separate from Memory Fabric:
|
|
4
|
+
|
|
5
|
+
- Memory Fabric builds a compact context pack for the current task.
|
|
6
|
+
- Memory Brain stores reviewed project knowledge with evidence, confidence, scope, and contradiction checks.
|
|
7
|
+
|
|
8
|
+
The first version is local-first and uses SQLite:
|
|
9
|
+
|
|
10
|
+
```text
|
|
11
|
+
.scale/memory/brain.sqlite
|
|
12
|
+
.scale/memory/brain-manifest.json
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Commands
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
scale memory ingest --from evidence --task-id <task-id>
|
|
19
|
+
scale memory ingest --from candidate --candidate-id <candidate-id>
|
|
20
|
+
scale memory ingest --from failure --failure-id <failure-replay-id>
|
|
21
|
+
scale memory query "OAuth callback state design"
|
|
22
|
+
scale memory contradictions
|
|
23
|
+
scale memory dream
|
|
24
|
+
scale memory promote <memory-node-id-or-candidate-id>
|
|
25
|
+
scale memory export --output .scale/memory/export.jsonl
|
|
26
|
+
scale memory import .scale/memory/export.jsonl
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Node Contract
|
|
30
|
+
|
|
31
|
+
```ts
|
|
32
|
+
interface MemoryNode {
|
|
33
|
+
id: string
|
|
34
|
+
type: 'fact' | 'decision' | 'incident' | 'relation' | 'contradiction'
|
|
35
|
+
title: string
|
|
36
|
+
summary: string
|
|
37
|
+
entities: string[]
|
|
38
|
+
source: 'runtime-evidence' | 'task-artifact' | 'docs' | 'git' | 'manual'
|
|
39
|
+
evidencePaths: string[]
|
|
40
|
+
confidence: number
|
|
41
|
+
scope: 'project' | 'workspace' | 'global-candidate'
|
|
42
|
+
status: 'candidate' | 'active' | 'stale' | 'rejected'
|
|
43
|
+
createdAt: string
|
|
44
|
+
updatedAt: string
|
|
45
|
+
lastVerifiedAt?: string
|
|
46
|
+
}
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Evidence Rule
|
|
50
|
+
|
|
51
|
+
Active memory must have at least one evidence path. SCALE blocks promotion when this is not true.
|
|
52
|
+
|
|
53
|
+
Runtime evidence and learning candidates are ingested as `candidate` records first. `scale memory promote` is the explicit boundary where reviewed memory becomes active.
|
|
54
|
+
|
|
55
|
+
Failure replay records can also be ingested as `incident` candidates:
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
scale eval run --suite workflow-baseline
|
|
59
|
+
scale eval failures --since 30d
|
|
60
|
+
scale memory ingest --from failure --failure-id <failure-replay-id>
|
|
61
|
+
scale memory promote <memory-node-id>
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
This connects Eval Harness failures to long-term memory without automatically rewriting project standards. A failure becomes active memory only after promotion and only if the replay artifact is present as evidence.
|
|
65
|
+
|
|
66
|
+
## Scope Rule
|
|
67
|
+
|
|
68
|
+
Project memory stays project-scoped by default. `global-candidate` is allowed for export and review, but it cannot be activated inside a project brain. This prevents one project's temporary truth from becoming a global rule.
|
|
69
|
+
|
|
70
|
+
## Contradiction Rule
|
|
71
|
+
|
|
72
|
+
`scale memory contradictions` reports conflicts instead of resolving them automatically. Examples:
|
|
73
|
+
|
|
74
|
+
- one memory says a provider is enabled, another says it is disabled
|
|
75
|
+
- one memory says a route exists, another says it is missing
|
|
76
|
+
- one memory says an operation is allowed, another says it is blocked
|
|
77
|
+
|
|
78
|
+
The command exits non-zero when active contradictions exist.
|
|
79
|
+
|
|
80
|
+
## Dream Maintenance
|
|
81
|
+
|
|
82
|
+
`scale memory dream` is a maintenance pass. It reports:
|
|
83
|
+
|
|
84
|
+
- promotion candidates
|
|
85
|
+
- stale active memories
|
|
86
|
+
- duplicate groups
|
|
87
|
+
- contradictions
|
|
88
|
+
- suggested docs to update
|
|
89
|
+
- active memories missing evidence
|
|
90
|
+
|
|
91
|
+
It does not auto-promote standards, rewrite docs, or delete memories.
|
|
92
|
+
|
|
93
|
+
## Resource Lifecycle
|
|
94
|
+
|
|
95
|
+
Memory Brain files under `.scale/memory/` are local runtime state by default. Commit only curated exports, documented decisions, or task artifacts that were intentionally reviewed.
|
|
96
|
+
|
|
97
|
+
Recommended flow:
|
|
98
|
+
|
|
99
|
+
```text
|
|
100
|
+
runtime evidence -> memory settle -> memory ingest -> memory promote -> docs/standards update when stable
|
|
101
|
+
eval failure replay -> memory ingest --from failure -> memory promote -> workflow rule update when stable
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
This keeps memory useful without turning every session observation into permanent project truth.
|
package/docs/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# SCALE Engine 文档地图
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
这个目录同时包含用户指南、治理能力说明、架构参考、历史规划和推广素材。新用户应优先阅读入门入口和当前治理能力文档,历史规划仅作为背景材料。
|
|
4
4
|
|
|
5
5
|
## 新用户入口
|
|
6
6
|
|
|
@@ -20,10 +20,17 @@
|
|
|
20
20
|
| [TOOL_ORCHESTRATION.md](TOOL_ORCHESTRATION.md) | skills、MCP、CLI、浏览器、桌面自动化的编排策略 |
|
|
21
21
|
| [RUNTIME_EVIDENCE.md](RUNTIME_EVIDENCE.md) | 会话 ledger、运行时证据和最终交付检查 |
|
|
22
22
|
| [MEMORY_FABRIC.md](MEMORY_FABRIC.md) | Runtime evidence、session events、knowledge recall 和 graph status 的预算化上下文包 |
|
|
23
|
+
| [MEMORY_BRAIN.md](MEMORY_BRAIN.md) | 证据驱动的长期记忆、矛盾检测、dream 整理和 failure replay 沉淀 |
|
|
24
|
+
| [CONTEXT_BUDGET.md](CONTEXT_BUDGET.md) | Context Budget、Progressive Governance、Lazy Loading 和 Governance ROI |
|
|
25
|
+
| [CODE_INTELLIGENCE.md](CODE_INTELLIGENCE.md) | CodeGraph、Graphify 和显式 fallback 的代码智能与探索 ROI |
|
|
26
|
+
| [WORKFLOW_EVAL.md](WORKFLOW_EVAL.md) | Workflow Eval、pass@k 指标、Failure Replay 和改进候选 |
|
|
27
|
+
| [SKILL_RADAR.md](SKILL_RADAR.md) | Skill Radar、能力置信度、证据要求和供应链安全检查 |
|
|
28
|
+
| [UPGRADE_MANAGEMENT.md](UPGRADE_MANAGEMENT.md) | SCALE CLI、governance pack、skills、MCP 和 CLI 工具的安全升级流程 |
|
|
29
|
+
| [GOVERNANCE_DASHBOARD.md](GOVERNANCE_DASHBOARD.md) | Runtime、eval、memory、resource、HTML artifact 的统一治理面板 |
|
|
23
30
|
| [RELEASE_READINESS.md](RELEASE_READINESS.md) | 发版前质量门槛、官方 demo 和真实项目落地验收 |
|
|
24
31
|
| [SKILL-REPOSITORY.md](SKILL-REPOSITORY.md) | 受治理 skill repository 和安装安全策略 |
|
|
25
32
|
| [VIBE-TEMPLATES.md](VIBE-TEMPLATES.md) | 可复制的 Vibe Coding 提示词模板 |
|
|
26
|
-
| [LEADERSHIP-PRESETS.md](LEADERSHIP-PRESETS.md) | CEO
|
|
33
|
+
| [LEADERSHIP-PRESETS.md](LEADERSHIP-PRESETS.md) | CEO、CTO、PM、Architect 等内置领导者角色预设 |
|
|
27
34
|
|
|
28
35
|
## 架构与参考
|
|
29
36
|
|
|
@@ -47,6 +54,7 @@
|
|
|
47
54
|
| [WEEK1-2-REPORT.md](WEEK1-2-REPORT.md) | 阶段报告 |
|
|
48
55
|
| [TASK_GUARD_SUMMARY.md](TASK_GUARD_SUMMARY.md) | Task Guard 总结 |
|
|
49
56
|
| [TASK_GUARD_WORKFLOW_DEMO.md](TASK_GUARD_WORKFLOW_DEMO.md) | 早期 workflow demo |
|
|
57
|
+
| [plans/2026-05-19-agent-engineering-os-upgrade-plan.md](plans/2026-05-19-agent-engineering-os-upgrade-plan.md) | Agent Engineering OS 升级审核稿:Context Budget、CodeGraph、Memory Brain、Skill Radar、HTML Artifact 和 Eval Harness |
|
|
50
58
|
| [plans/](plans/) | 规划方案和技术方案归档 |
|
|
51
59
|
| [superpowers/](superpowers/) | 外部方法论对照和计划归档 |
|
|
52
60
|
|
|
@@ -54,15 +62,16 @@
|
|
|
54
62
|
|
|
55
63
|
| 文档 | 说明 |
|
|
56
64
|
| --- | --- |
|
|
57
|
-
| [promote-article-
|
|
58
|
-
| [promote-article-
|
|
65
|
+
| [promote-article-v2.md](promote-article-v2.md) | 推广文章草稿 v2 |
|
|
66
|
+
| [promote-article-v2.html](promote-article-v2.html) | 推广文章 HTML v2 |
|
|
67
|
+
| [promote-article-v3.md](promote-article-v3.md) | 推广文章草稿 v3 |
|
|
68
|
+
| [promote-article-v3.html](promote-article-v3.html) | 推广文章 HTML v3 |
|
|
59
69
|
| [imgs/](imgs/) | 社群二维码和推广图片 |
|
|
60
70
|
|
|
61
71
|
## 维护规则
|
|
62
72
|
|
|
63
73
|
- 面向新用户的文档优先放在 `docs/start/`。
|
|
64
74
|
- 当前可执行能力放在根 README 和当前治理能力文档中。
|
|
65
|
-
-
|
|
66
|
-
- 如果 CLI 行为变化,必须同步更新 README
|
|
67
|
-
- 如果新增 governance pack,必须同时更新 README
|
|
68
|
-
|
|
75
|
+
- 历史规划不要混入新手教程,避免用户把旧计划当成当前事实。
|
|
76
|
+
- 如果 CLI 行为变化,必须同步更新 `README.md`、`docs/start/quickstart.md` 和相关 reference 文档。
|
|
77
|
+
- 如果新增 governance pack,必须同时更新 `README.md`、`docs/start/README.md` 和对应测试。
|
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
# Skill Radar
|
|
2
|
+
|
|
3
|
+
Skill Radar is the active capability selection layer for SCALE. It does not auto-install or blindly run skills. It scores relevant skills, MCP servers, browser tools, desktop automation, and external CLIs against the current task, then returns:
|
|
4
|
+
|
|
5
|
+
- why the capability matches
|
|
6
|
+
- confidence score
|
|
7
|
+
- safety level
|
|
8
|
+
- required evidence
|
|
9
|
+
- fallback path
|
|
10
|
+
- supply-chain checks before installation or promotion
|
|
11
|
+
|
|
12
|
+
The goal is to make agents actively use useful tools without turning the project into an unsafe prompt or tool bundle.
|
|
13
|
+
|
|
14
|
+
## Commands
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
scale skill radar --task "Design upload UI and run browser E2E checks" --files src/pages/upload.tsx
|
|
18
|
+
scale skill radar --task "Automate WPS desktop workflow with CUA" --json
|
|
19
|
+
scale skill radar --task "Review release PR" --phase review --level L --output docs/worklog/tasks/release/skill-radar.md
|
|
20
|
+
scale skill doctor --supply-chain
|
|
21
|
+
scale skill doctor --supply-chain --json
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
## Safety Levels
|
|
25
|
+
|
|
26
|
+
| Level | Meaning | Default action |
|
|
27
|
+
| --- | --- | --- |
|
|
28
|
+
| `trusted` | Official or low-risk capability with policy enabled | May be recommended when confidence is high |
|
|
29
|
+
| `review-required` | Third-party or ecosystem capability | Require source, license, scripts, and revision review |
|
|
30
|
+
| `restricted` | Browser, desktop, or external execution boundary | Require explicit evidence and side-effect boundaries |
|
|
31
|
+
| `blocked` | Disabled by policy or failed safety review | Do not run; use fallback |
|
|
32
|
+
|
|
33
|
+
## Confidence
|
|
34
|
+
|
|
35
|
+
Skill Radar combines:
|
|
36
|
+
|
|
37
|
+
- task keywords and workflow phase
|
|
38
|
+
- changed file patterns
|
|
39
|
+
- local skill installation
|
|
40
|
+
- tool availability
|
|
41
|
+
- trust level
|
|
42
|
+
- policy status
|
|
43
|
+
- frontend/package evidence
|
|
44
|
+
- safety penalties
|
|
45
|
+
|
|
46
|
+
The score is not a promise that the tool will work. It is a routing signal. Any recommendation still needs real evidence before the agent can claim success.
|
|
47
|
+
|
|
48
|
+
## Default Domains
|
|
49
|
+
|
|
50
|
+
| Domain | Typical triggers | Recommended capability types |
|
|
51
|
+
| --- | --- | --- |
|
|
52
|
+
| `ui` | UI, UX, frontend, component, visual, layout | design skills, visual review, screenshot evidence |
|
|
53
|
+
| `browserAutomation` | browser, E2E, Playwright, Chrome, DevTools | web access, browser automation, DevTools evidence |
|
|
54
|
+
| `desktopAutomation` | desktop, GUI, WPS, WeChat, CUA | disabled by default; manual operator fallback |
|
|
55
|
+
| `externalCli` | Codex, Gemini, OpenCode, external agent CLI | disabled by default; dry-run and output evidence |
|
|
56
|
+
| `review` | PR, merge, release, code review | reviewer skills, severity findings |
|
|
57
|
+
| `docs` | docs, README, ADR, governance asset | doc impact and source-of-truth evidence |
|
|
58
|
+
| `discovery` | skill, MCP, tool, capability discovery | find-skills plus safety review |
|
|
59
|
+
|
|
60
|
+
## Evidence Contract
|
|
61
|
+
|
|
62
|
+
Each recommendation carries required evidence. Examples:
|
|
63
|
+
|
|
64
|
+
- UI work: `ui-spec`, `design-rationale`, `screenshot`, `visual-review`
|
|
65
|
+
- Browser work: `browser-evidence`, `console-summary`, `network-summary`, `scenario-result`
|
|
66
|
+
- Desktop work: `operator-boundary`, `desktop-screenshot`, `affected-app`
|
|
67
|
+
- External CLI work: `cli-version-check`, `command`, `exit-code`, `output-summary`
|
|
68
|
+
- Review work: `review-report`, `finding-list`, `severity`
|
|
69
|
+
|
|
70
|
+
If evidence is missing, the final delivery should list the capability as unverified rather than claiming it was used successfully.
|
|
71
|
+
|
|
72
|
+
## Supply-Chain Doctor
|
|
73
|
+
|
|
74
|
+
`scale skill doctor --supply-chain` reviews known skill sources and install commands for:
|
|
75
|
+
|
|
76
|
+
- HTTPS source requirement
|
|
77
|
+
- `curl | bash`, `wget | sh`, `Invoke-Expression`, and `iex` blocking
|
|
78
|
+
- destructive install patterns
|
|
79
|
+
- npm/npx lifecycle script review
|
|
80
|
+
- required source, license, and revision checks
|
|
81
|
+
|
|
82
|
+
This is intentionally conservative. Third-party skills should start in review-required mode and be promoted only after inspection.
|
|
83
|
+
|
|
84
|
+
## Policy Integration
|
|
85
|
+
|
|
86
|
+
Skill Radar reads `.scale/tools.json` through the Tool Policy layer. Defaults:
|
|
87
|
+
|
|
88
|
+
- UI and browser capabilities are enabled but evidence-required.
|
|
89
|
+
- Desktop CUA is disabled by default.
|
|
90
|
+
- External agent CLIs are disabled by default.
|
|
91
|
+
- Browser tools require captured evidence and should stay in approved domains.
|
|
92
|
+
|
|
93
|
+
Use Tool Policy to enable a restricted capability deliberately rather than relying on an agent's assumption.
|
|
94
|
+
|
|
95
|
+
## Fallback Rule
|
|
96
|
+
|
|
97
|
+
Every recommendation must include a fallback. This prevents tool theater:
|
|
98
|
+
|
|
99
|
+
```text
|
|
100
|
+
If the capability is missing, unsafe, low-confidence, or policy-blocked,
|
|
101
|
+
the agent must use the fallback and record why the capability was not used.
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
## Artifact Lifecycle
|
|
105
|
+
|
|
106
|
+
Skill Radar reports can be written into task artifacts:
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
scale skill radar \
|
|
110
|
+
--task "Refactor upload page and verify browser flow" \
|
|
111
|
+
--files src/pages/upload.tsx \
|
|
112
|
+
--output docs/worklog/tasks/2026-05-19-upload-refactor/skill-radar.md
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Keep the report when it is evidence for an M/L/CRITICAL task. Do not commit transient local detection output unless it is part of the reviewed task artifact set.
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
# Workflow Eval Harness
|
|
2
|
+
|
|
3
|
+
Status: implemented baseline
|
|
4
|
+
Since: v0.22 development branch
|
|
5
|
+
|
|
6
|
+
Workflow Eval Harness 用来证明工作流是否真的提升了 Agent 的工程交付质量,而不是只依赖主观感觉。它会运行轻量 eval suite,记录 pass@k、修复迭代、工具调用、token 估算、人类纠偏次数,并在失败时保留 Failure Replay。
|
|
7
|
+
|
|
8
|
+
## Commands
|
|
9
|
+
|
|
10
|
+
初始化默认基线套件:
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
scale eval init
|
|
14
|
+
scale eval init --suite workflow-baseline --json
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
运行套件:
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
scale eval run --suite workflow-baseline
|
|
21
|
+
scale eval run --suite workflow-baseline --json
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
对比两次运行:
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
scale eval compare --baseline <run-id> --candidate <run-id>
|
|
28
|
+
scale eval compare --baseline <run-id> --candidate <run-id> --json
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
生成 Markdown 报告:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
scale eval report --run <run-id>
|
|
35
|
+
scale eval report --run <run-id> --output docs/worklog/eval-report.md
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
查看和提升失败重放:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
scale eval failures --since 30d
|
|
42
|
+
scale eval replay <failure-id>
|
|
43
|
+
scale eval replay --task-id <task-id>
|
|
44
|
+
scale eval promote-failure <failure-id>
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
## Failure Replay To Memory
|
|
48
|
+
|
|
49
|
+
Failure Replay is local eval evidence first. When a failure pattern is useful for future work, ingest it into Memory Brain as an `incident` candidate:
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
scale memory ingest --from failure --failure-id <failure-id>
|
|
53
|
+
scale memory query "missing verification evidence"
|
|
54
|
+
scale memory promote <memory-node-id>
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
This does not auto-change standards or hooks. It only makes the failure queryable and evidence-backed so repeated mistakes can be promoted deliberately after review.
|
|
58
|
+
|
|
59
|
+
## Storage
|
|
60
|
+
|
|
61
|
+
```text
|
|
62
|
+
.scale/evals/
|
|
63
|
+
├── suites/
|
|
64
|
+
├── runs/
|
|
65
|
+
├── failures/
|
|
66
|
+
└── improvements/
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
These files are local runtime evidence by default. Commit only curated summaries or intentional benchmark fixtures.
|
|
70
|
+
|
|
71
|
+
## Suite Shape
|
|
72
|
+
|
|
73
|
+
```json
|
|
74
|
+
{
|
|
75
|
+
"version": "1.0",
|
|
76
|
+
"id": "workflow-baseline",
|
|
77
|
+
"name": "SCALE workflow baseline",
|
|
78
|
+
"cases": [
|
|
79
|
+
{
|
|
80
|
+
"id": "governance-command-smoke",
|
|
81
|
+
"type": "bugfix",
|
|
82
|
+
"title": "Command evidence smoke",
|
|
83
|
+
"task": "Verify that a local command can produce concrete eval evidence.",
|
|
84
|
+
"phase": "verify",
|
|
85
|
+
"successCriteria": ["command exits 0"],
|
|
86
|
+
"attempts": [
|
|
87
|
+
{
|
|
88
|
+
"id": "attempt-1",
|
|
89
|
+
"command": "node -e \"console.log('scale-eval-ok')\"",
|
|
90
|
+
"expectedExitCode": 0,
|
|
91
|
+
"outputContains": "scale-eval-ok"
|
|
92
|
+
}
|
|
93
|
+
]
|
|
94
|
+
}
|
|
95
|
+
]
|
|
96
|
+
}
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## Metrics
|
|
100
|
+
|
|
101
|
+
| Metric | Meaning |
|
|
102
|
+
| --- | --- |
|
|
103
|
+
| `passAt1Rate` | 一次完整尝试就通过的比例 |
|
|
104
|
+
| `passAt3Rate` | 三次以内通过的比例 |
|
|
105
|
+
| `averageFixIterations` | 首次失败后的平均修复循环 |
|
|
106
|
+
| `totalToolCalls` | eval attempts 数量,可近似衡量工具调用成本 |
|
|
107
|
+
| `estimatedTokens` | task 与输出摘要的估算 token 成本 |
|
|
108
|
+
| `humanCorrections` | 人类纠偏次数 |
|
|
109
|
+
| `failureReplayCount` | 失败重放记录数量 |
|
|
110
|
+
|
|
111
|
+
## Failure Replay
|
|
112
|
+
|
|
113
|
+
失败不只记录最终失败状态,还会保存:
|
|
114
|
+
|
|
115
|
+
- task and success criteria
|
|
116
|
+
- phase
|
|
117
|
+
- wrong turn
|
|
118
|
+
- evidence
|
|
119
|
+
- correction
|
|
120
|
+
- prevention
|
|
121
|
+
- replay command
|
|
122
|
+
- redaction status
|
|
123
|
+
|
|
124
|
+
Failure category 当前包括:
|
|
125
|
+
|
|
126
|
+
- `wrong-exploration-path`
|
|
127
|
+
- `hallucinated-project-fact`
|
|
128
|
+
- `missing-codegraph-or-graph-fallback`
|
|
129
|
+
- `over-broad-context-load`
|
|
130
|
+
- `bad-skill-recommendation`
|
|
131
|
+
- `missing-verification-evidence`
|
|
132
|
+
- `failed-security-or-resource-gate`
|
|
133
|
+
- `human-correction-after-agent-confidence`
|
|
134
|
+
- `command-failure`
|
|
135
|
+
- `unknown`
|
|
136
|
+
|
|
137
|
+
`scale eval promote-failure` 会把失败重放提升为 improvement candidate,但不会自动修改项目规范。是否进入长期标准仍需要人工或后续 review 确认。
|
|
138
|
+
|
|
139
|
+
## Governance Use
|
|
140
|
+
|
|
141
|
+
- v0.22 的默认 suite 是轻量 smoke baseline,用来验证 eval 管线可运行。
|
|
142
|
+
- 真实项目应逐步增加 bugfix、feature、security、frontend、release、resource 类型案例。
|
|
143
|
+
- Failure Replay 应与 Resource Governance 配合:默认本地保留,只有总结、基准或明确要长期维护的案例才提交。
|
|
144
|
+
- Workflow Eval 的数据可以进入后续 Governance ROI,用来判断某个治理模块是否真的减少 rework、tool calls、token 或人类纠偏。
|
|
145
|
+
|
|
146
|
+
## Policy
|
|
147
|
+
|
|
148
|
+
- 不允许用 eval 通过率替代真实项目验证。
|
|
149
|
+
- 失败记录中的命令输出会做基础脱敏,但仍应避免把敏感原始日志写入 suite。
|
|
150
|
+
- 低成本 smoke suite 可以频繁运行;重型项目 suite 应按需运行。
|
|
151
|
+
- 没有 eval 证据时,不应宣称工作流能力已经提升。
|
package/docs/start/README.md
CHANGED
|
@@ -13,7 +13,10 @@
|
|
|
13
13
|
3. 回到根目录 [README](../../README.md)
|
|
14
14
|
理解 SCALE Engine 的核心能力和 governance pack 选择。
|
|
15
15
|
|
|
16
|
-
4.
|
|
16
|
+
4. [升级管理](../UPGRADE_MANAGEMENT.md)
|
|
17
|
+
理解工作流更新、第三方 skills/MCP/CLI 更新时如何先检查、生成计划、避免覆盖本地改动。
|
|
18
|
+
|
|
19
|
+
5. 查看 [文档地图](../README.md)
|
|
17
20
|
区分哪些文档是用户指南、哪些是参考资料、哪些是历史规划和过程记录。
|
|
18
21
|
|
|
19
22
|
## 你应该先看到什么
|
|
@@ -39,4 +42,5 @@
|
|
|
39
42
|
| Go 多服务后端 | `scale init --governance-pack go-service-matrix` |
|
|
40
43
|
| 多仓库/MOE 工作区 | `scale init --governance-pack moe-workspace` |
|
|
41
44
|
| 文档、报告、截图、脚本混乱 | `scale init --governance-pack resource-governance` |
|
|
45
|
+
| 工作流或第三方能力要升级 | `scale upgrade check && scale upgrade plan --html` |
|
|
42
46
|
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
# CONTEXT.md
|
|
2
|
+
|
|
3
|
+
Project: Agent Governance Demo
|
|
4
|
+
|
|
5
|
+
| Term | Definition | Examples | Aliases | Source |
|
|
6
|
+
|------|------------|----------|---------|--------|
|
|
7
|
+
| OAuth state | One-time callback correlation value that binds authorization return traffic to a user session | `state-123` | callback state | `src/oauth-state.ts` |
|
|
8
|
+
| Consumed state | A state record that has already been used and must not be accepted again | `consumedAt: 900` | replayed state | `tests/oauth-state.test.ts` |
|
|
9
|
+
| Evidence | A command result or artifact that proves what was verified | `npm test`, eval report, dashboard | verification proof | SCALE workflow |
|
|
10
|
+
|
|
11
|
+
## Rejected Meanings
|
|
12
|
+
|
|
13
|
+
- Do not treat an expired state as recoverable without a new authorization flow.
|
|
14
|
+
- Do not treat a dashboard or eval report as a substitute for the business test.
|