@hongmaple0820/scale-engine 0.26.0 → 0.27.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.en.md +71 -3
- package/README.md +71 -3
- package/dist/api/cli.js +269 -12
- package/dist/api/cli.js.map +1 -1
- package/dist/cli/phaseCommands.js +8 -8
- package/dist/cli/phaseCommands.js.map +1 -1
- package/dist/context/ContextBudget.d.ts +14 -0
- package/dist/context/ContextBudget.js +50 -14
- package/dist/context/ContextBudget.js.map +1 -1
- package/dist/context/ContextCompiler.d.ts +34 -0
- package/dist/context/ContextCompiler.js +120 -0
- package/dist/context/ContextCompiler.js.map +1 -0
- package/dist/eval/WorkflowEval.js +4 -6
- package/dist/eval/WorkflowEval.js.map +1 -1
- package/dist/governance/GovernanceRoi.d.ts +6 -1
- package/dist/governance/GovernanceRoi.js +32 -0
- package/dist/governance/GovernanceRoi.js.map +1 -1
- package/dist/guardrails/DependencyAuditor.js +38 -0
- package/dist/guardrails/DependencyAuditor.js.map +1 -1
- package/dist/index.d.ts +1 -0
- package/dist/index.js +1 -0
- package/dist/index.js.map +1 -1
- package/dist/runtime/AiOsRuntime.d.ts +269 -0
- package/dist/runtime/AiOsRuntime.js +840 -0
- package/dist/runtime/AiOsRuntime.js.map +1 -0
- package/dist/runtime/index.d.ts +1 -0
- package/dist/runtime/index.js +1 -0
- package/dist/runtime/index.js.map +1 -1
- package/dist/skills/routing/SkillPlanner.js +91 -3
- package/dist/skills/routing/SkillPlanner.js.map +1 -1
- package/dist/skills/routing/SkillRoutingTypes.d.ts +17 -0
- package/dist/tools/SafeCommandRunner.d.ts +16 -0
- package/dist/tools/SafeCommandRunner.js +83 -0
- package/dist/tools/SafeCommandRunner.js.map +1 -0
- package/dist/workflow/UpgradeManager.d.ts +4 -1
- package/dist/workflow/UpgradeManager.js +26 -0
- package/dist/workflow/UpgradeManager.js.map +1 -1
- package/dist/workflow/gates/GateSystem.js +3 -9
- package/dist/workflow/gates/GateSystem.js.map +1 -1
- package/docs/AI_ENGINEERING_OS_POSITIONING.md +560 -0
- package/docs/CONTEXT_BUDGET.md +43 -1
- package/docs/DEPENDENCY_AUDIT.md +29 -0
- package/docs/MEMORY_FABRIC.md +2 -0
- package/docs/README.md +1 -0
- package/docs/SKILL_RADAR.md +13 -0
- package/package.json +9 -2
|
@@ -0,0 +1,560 @@
|
|
|
1
|
+
# SCALE Engine Strategic Positioning
|
|
2
|
+
|
|
3
|
+
> Date: 2026-05-20
|
|
4
|
+
> Status: strategic direction with a v0.27.0 runtime baseline
|
|
5
|
+
> Audience: maintainers, contributors, roadmap reviewers, and product-facing documentation owners
|
|
6
|
+
|
|
7
|
+
SCALE Engine should be positioned as an **Agent Governance Runtime** evolving toward an **AI Engineering OS**.
|
|
8
|
+
|
|
9
|
+
The project is no longer best described as a prompt toolbox. Its durable value is the runtime layer around AI coding agents:
|
|
10
|
+
|
|
11
|
+
- workflow state machines
|
|
12
|
+
- hard gates and verification evidence
|
|
13
|
+
- hook-based tool interception
|
|
14
|
+
- role and permission boundaries
|
|
15
|
+
- artifact persistence
|
|
16
|
+
- context budgets
|
|
17
|
+
- memory provider routing
|
|
18
|
+
- skill, MCP, CLI, and adapter orchestration
|
|
19
|
+
|
|
20
|
+
The core thesis is:
|
|
21
|
+
|
|
22
|
+
> Use system constraints, evidence, and runtime gates to replace agent self-discipline.
|
|
23
|
+
|
|
24
|
+
This positioning is intentionally stronger than "prompt engineering", but it must stay evidence-backed. SCALE can describe the direction as AI Engineering OS; it should only describe measurable gains after benchmark, eval, or runtime evidence exists.
|
|
25
|
+
|
|
26
|
+
## Reference Inputs
|
|
27
|
+
|
|
28
|
+
This document consolidates:
|
|
29
|
+
|
|
30
|
+
- the current SCALE Engine architecture and documentation surfaces
|
|
31
|
+
- maintainer review of SCALE as an Agent Governance Runtime
|
|
32
|
+
- the external Harness Engineering framing in [SCALE OS v10.0: AI 编码的认知操作系统](https://segmentfault.com/a/1190000047756584)
|
|
33
|
+
|
|
34
|
+
External performance claims from ecosystem articles are positioning inputs, not SCALE Engine release claims. Public SCALE claims should still be backed by local evals, runtime evidence, or release reports.
|
|
35
|
+
|
|
36
|
+
## 1. Market Category
|
|
37
|
+
|
|
38
|
+
SCALE sits between four emerging categories:
|
|
39
|
+
|
|
40
|
+
| Category | SCALE relationship |
|
|
41
|
+
| --- | --- |
|
|
42
|
+
| AI Engineering OS | Long-term positioning: one governed operating layer for agent-driven engineering |
|
|
43
|
+
| Agent Governance Runtime | Current strongest fit: gates, hooks, evidence, role boundaries, and policy enforcement |
|
|
44
|
+
| Workflow Orchestration Runtime | Current fit: FSM, phase commands, artifacts, verification, and ship flow |
|
|
45
|
+
| Harness Engineering Infrastructure | Methodology fit: constraints + feedback + workflow + continuous improvement |
|
|
46
|
+
|
|
47
|
+
SCALE should avoid being framed as only:
|
|
48
|
+
|
|
49
|
+
- prompt templates
|
|
50
|
+
- agent rules
|
|
51
|
+
- a Claude/Cursor/Codex config generator
|
|
52
|
+
- an AutoGPT-style chain executor
|
|
53
|
+
- a generic skills catalog
|
|
54
|
+
|
|
55
|
+
Those may exist around the project, but they are not the core defensible layer.
|
|
56
|
+
|
|
57
|
+
## 2. Problem Definition
|
|
58
|
+
|
|
59
|
+
AI coding failures are not only model-quality failures. They are engineering runtime failures:
|
|
60
|
+
|
|
61
|
+
| Failure mode | Typical prompt-only response | SCALE response |
|
|
62
|
+
| --- | --- | --- |
|
|
63
|
+
| Fake completion | "Please verify before finishing" | verification gates and final evidence checks |
|
|
64
|
+
| Skipped tests | reminder text | FSM and verification status before completion |
|
|
65
|
+
| Repeated blind retries | "try a different approach" | retry and behavior detectors |
|
|
66
|
+
| Context overload | longer instructions | context budgets, lazy loading, scoped packs |
|
|
67
|
+
| Agent drift | more rules | persisted workflow state and phase boundaries |
|
|
68
|
+
| Hallucinated delivery | review prompt | runtime evidence ledger and ship gates |
|
|
69
|
+
| Lost learning | chat history | memory artifacts, failure replay, lessons, rule candidates |
|
|
70
|
+
| Multi-agent confusion | role descriptions | role gateway and tool permission boundaries |
|
|
71
|
+
| Tool overreach | trust agent judgment | hook interception and policy gateway |
|
|
72
|
+
|
|
73
|
+
The strategic target is not to make the model "more obedient". The target is to make non-compliant behavior observable, blockable, and recoverable.
|
|
74
|
+
|
|
75
|
+
## 3. Current Strengths
|
|
76
|
+
|
|
77
|
+
### 3.1 Runtime Constraints
|
|
78
|
+
|
|
79
|
+
SCALE already has the right architectural instinct: lower critical rules from prompt text into runtime checks.
|
|
80
|
+
|
|
81
|
+
Relevant surfaces:
|
|
82
|
+
|
|
83
|
+
- `docs/ENGINEERING_STANDARDS.md`
|
|
84
|
+
- `docs/RUNTIME_EVIDENCE.md`
|
|
85
|
+
- `docs/DEPENDENCY_AUDIT.md`
|
|
86
|
+
- `src/workflow/gates/GateSystem.ts`
|
|
87
|
+
- `src/guardrails/Gateway.ts`
|
|
88
|
+
- `src/artifact/fsm.ts`
|
|
89
|
+
|
|
90
|
+
This is the primary moat. Prompt rules can be ignored; runtime gates can block progress.
|
|
91
|
+
|
|
92
|
+
### 3.2 Workflow State Machine
|
|
93
|
+
|
|
94
|
+
The workflow is driven by artifact state, not by chat momentum.
|
|
95
|
+
|
|
96
|
+
Strategic value:
|
|
97
|
+
|
|
98
|
+
- prevents premature completion
|
|
99
|
+
- forces phase-specific evidence
|
|
100
|
+
- makes stalled or skipped phases visible
|
|
101
|
+
- supports resume and handoff across long sessions
|
|
102
|
+
- gives agent platforms a shared lifecycle model
|
|
103
|
+
|
|
104
|
+
The FSM should remain strict at phase boundaries and flexible inside each phase.
|
|
105
|
+
|
|
106
|
+
### 3.3 Hook and Gateway Layer
|
|
107
|
+
|
|
108
|
+
Hooks, pre-tool checks, post-tool checks, stop checks, and role-aware gateway decisions form the AI runtime interceptor layer.
|
|
109
|
+
|
|
110
|
+
Strategic value:
|
|
111
|
+
|
|
112
|
+
- agents do not receive raw, unlimited tool authority
|
|
113
|
+
- unsafe operations can be blocked before execution
|
|
114
|
+
- tool output can be converted into evidence
|
|
115
|
+
- repeated failure patterns can be detected outside the model
|
|
116
|
+
|
|
117
|
+
This layer makes SCALE closer to an admission controller than a prompt pack.
|
|
118
|
+
|
|
119
|
+
### 3.4 Evidence-Backed Delivery
|
|
120
|
+
|
|
121
|
+
SCALE's strongest anti-hallucination capability is engineering hallucination control:
|
|
122
|
+
|
|
123
|
+
- no test evidence means no verified claim
|
|
124
|
+
- no runtime evidence means no product-smoke claim
|
|
125
|
+
- no reviewed file scope means no governed ship
|
|
126
|
+
- no dependency audit evidence means weaker security confidence
|
|
127
|
+
|
|
128
|
+
This reduces fake completion more reliably than instruction text.
|
|
129
|
+
|
|
130
|
+
It does not fully solve reasoning hallucination. Architecture decisions, root-cause analysis, and technical tradeoffs still need evaluator intelligence.
|
|
131
|
+
|
|
132
|
+
### 3.5 Adapter and Platform Surface
|
|
133
|
+
|
|
134
|
+
The agent-platform adapters let SCALE act as a shared governance layer for different coding agents.
|
|
135
|
+
|
|
136
|
+
Strategic value:
|
|
137
|
+
|
|
138
|
+
- one governance model across Claude Code, Codex, Cursor, Gemini, Windsurf, Kiro, Cline, and related tools
|
|
139
|
+
- fewer duplicated rule files
|
|
140
|
+
- lower switching cost between agents
|
|
141
|
+
- consistent evidence and workflow semantics
|
|
142
|
+
|
|
143
|
+
Adapter expansion should not become the main roadmap by itself. The strategic value comes from shared governance semantics, not from the count of supported agents.
|
|
144
|
+
|
|
145
|
+
## 4. Honest Capability Assessment
|
|
146
|
+
|
|
147
|
+
SCALE can already claim:
|
|
148
|
+
|
|
149
|
+
- stronger engineering governance than prompt-only rules
|
|
150
|
+
- structured workflow execution with phase and artifact state
|
|
151
|
+
- hard verification gates for delivery claims
|
|
152
|
+
- evidence-based runtime reporting
|
|
153
|
+
- first-class supply-chain audit direction
|
|
154
|
+
- growing adapter coverage
|
|
155
|
+
- memory and skill orchestration foundations
|
|
156
|
+
|
|
157
|
+
SCALE should not yet overclaim:
|
|
158
|
+
|
|
159
|
+
- fully autonomous self-evolution
|
|
160
|
+
- human-level long-term memory
|
|
161
|
+
- guaranteed token reduction percentages
|
|
162
|
+
- guaranteed hallucination reduction percentages
|
|
163
|
+
- adaptive cognitive planning
|
|
164
|
+
- universal skill routing intelligence
|
|
165
|
+
|
|
166
|
+
Use target ranges only in roadmap or evaluation documents, not as product claims, until eval evidence supports them.
|
|
167
|
+
|
|
168
|
+
## 5. Current Gaps
|
|
169
|
+
|
|
170
|
+
### 5.1 Memory Architecture
|
|
171
|
+
|
|
172
|
+
Current state is closer to engineering knowledge persistence than true cognitive memory.
|
|
173
|
+
|
|
174
|
+
Existing strengths:
|
|
175
|
+
|
|
176
|
+
- artifacts persist decisions and work state
|
|
177
|
+
- memory brain stores evidence-backed learnings
|
|
178
|
+
- failure replay can preserve incidents
|
|
179
|
+
- provider routing gives the right extension point
|
|
180
|
+
|
|
181
|
+
Missing layers:
|
|
182
|
+
|
|
183
|
+
| Memory type | Target meaning |
|
|
184
|
+
| --- | --- |
|
|
185
|
+
| Working memory | short-lived task context with strict token budget |
|
|
186
|
+
| Episodic memory | past task episodes, failures, fixes, and outcomes |
|
|
187
|
+
| Semantic memory | stable project facts and domain concepts |
|
|
188
|
+
| Procedural memory | reusable ways of doing work |
|
|
189
|
+
| Strategy memory | learned routing, verification, and recovery strategies |
|
|
190
|
+
|
|
191
|
+
The next memory work should focus on provider-backed retrieval quality, not more local file accumulation.
|
|
192
|
+
|
|
193
|
+
### 5.2 Context Compiler
|
|
194
|
+
|
|
195
|
+
SCALE has context structure and budgets. It does not yet have a full context compiler.
|
|
196
|
+
|
|
197
|
+
Current capability:
|
|
198
|
+
|
|
199
|
+
- categorize context
|
|
200
|
+
- budget context
|
|
201
|
+
- lazy-load selected material
|
|
202
|
+
- assemble role/task-specific packs
|
|
203
|
+
|
|
204
|
+
Target capability:
|
|
205
|
+
|
|
206
|
+
- rank relevance
|
|
207
|
+
- slice semantically
|
|
208
|
+
- compress adaptively
|
|
209
|
+
- route retrieval by task intent
|
|
210
|
+
- explain why each context item was included
|
|
211
|
+
- measure token saved vs evidence lost
|
|
212
|
+
|
|
213
|
+
This is the highest-leverage path for token reduction.
|
|
214
|
+
|
|
215
|
+
### 5.3 Adaptive Workflow
|
|
216
|
+
|
|
217
|
+
The current workflow is mostly rule-driven.
|
|
218
|
+
|
|
219
|
+
The target workflow should adapt based on:
|
|
220
|
+
|
|
221
|
+
- task risk
|
|
222
|
+
- code ownership boundaries
|
|
223
|
+
- prior failure rate
|
|
224
|
+
- changed-file blast radius
|
|
225
|
+
- missing evidence
|
|
226
|
+
- tool reliability
|
|
227
|
+
- agent capability confidence
|
|
228
|
+
|
|
229
|
+
The system should not make every task heavy. It should apply stricter gates when risk rises and keep small documentation or config changes lightweight.
|
|
230
|
+
|
|
231
|
+
### 5.4 Skill Routing Intelligence
|
|
232
|
+
|
|
233
|
+
SCALE already models skills, MCP, CLI, browser, desktop automation, and evidence requirements.
|
|
234
|
+
|
|
235
|
+
The missing layer is strategy:
|
|
236
|
+
|
|
237
|
+
- when to call a skill
|
|
238
|
+
- why that skill is preferred
|
|
239
|
+
- what evidence it must produce
|
|
240
|
+
- what to do when it fails
|
|
241
|
+
- when to switch to MCP or CLI
|
|
242
|
+
- when to avoid tool use entirely
|
|
243
|
+
|
|
244
|
+
Skill routing should become a planned execution graph, not an ad hoc recommendation list.
|
|
245
|
+
|
|
246
|
+
### 5.5 Evaluator Intelligence
|
|
247
|
+
|
|
248
|
+
Current gates are strong for engineering completion, but weaker for reasoning quality.
|
|
249
|
+
|
|
250
|
+
Needed evaluator layers:
|
|
251
|
+
|
|
252
|
+
- critique loop for architecture and root cause
|
|
253
|
+
- uncertainty scoring
|
|
254
|
+
- adversarial review on high-risk changes
|
|
255
|
+
- tradeoff comparison
|
|
256
|
+
- failure hypothesis ranking
|
|
257
|
+
- "evidence is insufficient" verdicts
|
|
258
|
+
|
|
259
|
+
This is the path to reducing reasoning hallucination rather than only delivery hallucination.
|
|
260
|
+
|
|
261
|
+
### 5.6 Self-Optimization Loop
|
|
262
|
+
|
|
263
|
+
Evolution should mean more than summarizing lessons.
|
|
264
|
+
|
|
265
|
+
The target loop:
|
|
266
|
+
|
|
267
|
+
```text
|
|
268
|
+
failure evidence
|
|
269
|
+
-> defect record
|
|
270
|
+
-> root-cause classification
|
|
271
|
+
-> lesson candidate
|
|
272
|
+
-> rule candidate
|
|
273
|
+
-> hook or gate proposal
|
|
274
|
+
-> shadow validation
|
|
275
|
+
-> regression check
|
|
276
|
+
-> promoted governance behavior
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
The promotion step must remain evidence-backed. Automatically generating rules without validation risks turning mistakes into permanent friction.
|
|
280
|
+
|
|
281
|
+
## 6. Roadmap Direction
|
|
282
|
+
|
|
283
|
+
### 6.1 Planning Principle
|
|
284
|
+
|
|
285
|
+
The roadmap has two horizons:
|
|
286
|
+
|
|
287
|
+
| Horizon | Purpose | Claim boundary |
|
|
288
|
+
| --- | --- | --- |
|
|
289
|
+
| One-week closure | deliver a runnable AI OS beta loop that can be tested in real repositories | "usable beta", not "stable final OS" |
|
|
290
|
+
| Long-range vision | keep SCALE moving toward an AI Engineering OS with memory, context, governance, and tool intelligence | directional until backed by eval data |
|
|
291
|
+
|
|
292
|
+
The near-term work should be aggressive, but public wording must stay precise. SCALE can ship beta capabilities quickly; it should only claim stable, industry-leading AI OS behavior after repeated project evidence, benchmarks, and upgrade validation.
|
|
293
|
+
|
|
294
|
+
### 6.2 0.27.0: Cognitive Runtime Layer
|
|
295
|
+
|
|
296
|
+
Theme: make context, memory, and skill use more intelligent and explainable.
|
|
297
|
+
|
|
298
|
+
Core work:
|
|
299
|
+
|
|
300
|
+
| Module | Outcome |
|
|
301
|
+
| --- | --- |
|
|
302
|
+
| Context Compiler | relevance-ranked, budgeted, explainable context packs |
|
|
303
|
+
| Memory Provider Runtime | gbrain, agentmemory, code memory, and local memory as provider choices |
|
|
304
|
+
| Skill Routing Engine | task-intent routing with evidence requirements and fallback decisions |
|
|
305
|
+
| Governance ROI | quantify token cost, evidence quality, and gate friction |
|
|
306
|
+
|
|
307
|
+
Implemented baseline in v0.27.0:
|
|
308
|
+
|
|
309
|
+
```bash
|
|
310
|
+
scale ai-os plan \
|
|
311
|
+
--task-id TASK-123 \
|
|
312
|
+
--task "Fix OAuth callback auth token handling and verify browser flow" \
|
|
313
|
+
--level L \
|
|
314
|
+
--files src/auth/oauth.ts,src/ui/callback.tsx \
|
|
315
|
+
--budget 8000 \
|
|
316
|
+
--json
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
The command returns one runtime plan containing:
|
|
320
|
+
|
|
321
|
+
- `governance`: progressive mode, risk signals, required behaviors
|
|
322
|
+
- `context`: Context Compiler ranking, included sections, omitted sections, token savings
|
|
323
|
+
- `memory`: provider order, selected providers, fallback status, recalled items, memory context pack
|
|
324
|
+
- `skillPlan`: detected intents plus executable skill/artifact/verification steps
|
|
325
|
+
- `adaptiveWorkflow`: risk-adaptive gates and exit criteria for the task
|
|
326
|
+
- `roi`: benefit and overhead modules for context, memory, skill routing, and governance
|
|
327
|
+
|
|
328
|
+
Exit criteria:
|
|
329
|
+
|
|
330
|
+
- each context item has an inclusion reason: baseline implemented by `ContextCompiler`
|
|
331
|
+
- memory recall has provider, score, and evidence source: baseline implemented by Memory Provider Router
|
|
332
|
+
- skill recommendations include why, when, and required proof: baseline implemented by skill execution plans
|
|
333
|
+
- context pack generation reports token budget and omissions: baseline implemented by `context.pack.compiler`
|
|
334
|
+
|
|
335
|
+
### 6.3 0.27.x: One-Week AI OS Beta Closure
|
|
336
|
+
|
|
337
|
+
Theme: turn `ai-os plan` into a runnable beta loop.
|
|
338
|
+
|
|
339
|
+
Target timebox: one week.
|
|
340
|
+
|
|
341
|
+
Core work:
|
|
342
|
+
|
|
343
|
+
| Module | Outcome |
|
|
344
|
+
| --- | --- |
|
|
345
|
+
| `scale ai-os run` | execute the unified plan through workflow, context, memory, skill routing, and verification steps |
|
|
346
|
+
| Memory Provider Bridge | make gbrain, agentmemory, code memory, and local memory selectable through one provider contract |
|
|
347
|
+
| Context Compiler v2 | merge task intent, risk level, files, memory recall, and role into one explainable context pack |
|
|
348
|
+
| Skill Router v2 | create an execution graph for skills, MCP tools, CLIs, artifacts, and required evidence |
|
|
349
|
+
| Adaptive Workflow Profiles | choose light, standard, or strict gates from risk and changed-file signals |
|
|
350
|
+
| Failure Learning | convert failed gates, test failures, and missing evidence into lesson and rule candidates |
|
|
351
|
+
| AI OS Dashboard CLI | summarize gate health, memory hits, context budget, skill evidence, and ROI |
|
|
352
|
+
| Upgrade/Migration | migrate older `.scale` state and warn about incompatible local governance files |
|
|
353
|
+
| AI OS Doctor | check runtime directories, run history, dashboard health, and benchmark freshness before adoption or release |
|
|
354
|
+
| Bilingual DX | keep key CLI help, errors, README guidance, and tutorials readable in Chinese and English |
|
|
355
|
+
| Benchmark Pack | run fixed samples for token budget, recall, gate pass rate, and skill-routing evidence |
|
|
356
|
+
|
|
357
|
+
Exit criteria:
|
|
358
|
+
|
|
359
|
+
- `scale ai-os run` can complete at least one documentation task and one code task in dry-run or guarded execution mode
|
|
360
|
+
- execution output records context decisions, memory provider choices, skill decisions, gate results, and failure lessons
|
|
361
|
+
- benchmark output compares context token budget against a full-load baseline
|
|
362
|
+
- beta docs clearly state what is automated, what is proposed, and what still requires human approval
|
|
363
|
+
|
|
364
|
+
Current landing status:
|
|
365
|
+
|
|
366
|
+
- `scale ai-os run --dry-run` exists as the first beta slice.
|
|
367
|
+
- It reuses `createAiOsPlan`, expands it into run steps, evidence requirements, next actions, and a persisted `.scale/ai-os/runs/<task-id>.json` report.
|
|
368
|
+
- `scale ai-os run --mode guarded --verify "<command>"` executes explicit verification commands through the safe command runner, records each command as runtime evidence, and blocks the run when verification fails.
|
|
369
|
+
- `scale ai-os dashboard` summarizes persisted run reports into ready/blocked counts, guarded verification health, pending evidence, failure learning candidates, and next recommendations.
|
|
370
|
+
- `scale ai-os benchmark` runs fixed beta scenarios and reports context token use, estimated savings, memory recall, skill steps, governance modes, and the current dashboard health snapshot.
|
|
371
|
+
- `scale ai-os migrate` creates or verifies the `.scale/ai-os` runtime directories and writes an idempotent migration report.
|
|
372
|
+
- `scale ai-os doctor --lang zh|en` checks AI OS runtime readiness without mutating the project and blocks adoption when required directories or dashboard health are broken.
|
|
373
|
+
- `scale upgrade check/plan` includes AI OS readiness, so existing projects see migration and doctor steps through the normal upgrade workflow.
|
|
374
|
+
- It does not yet create PRs or mutate source files; richer skill execution remains the next implementation slice.
|
|
375
|
+
|
|
376
|
+
Explicitly deferred:
|
|
377
|
+
|
|
378
|
+
- default automatic PR creation or merge without review
|
|
379
|
+
- deep dynamic dependency sandboxing beyond audit, lockfile diff, and high-risk pattern checks
|
|
380
|
+
- full VLM visual judgment beyond screenshot capture and interface placeholders
|
|
381
|
+
- claims of human-level long-term memory or fully autonomous engineering
|
|
382
|
+
|
|
383
|
+
### 6.4 0.29.0: Memory, Context, and Skill Intelligence
|
|
384
|
+
|
|
385
|
+
Theme: make the beta loop measurably smarter rather than only broader.
|
|
386
|
+
|
|
387
|
+
Core work:
|
|
388
|
+
|
|
389
|
+
| Module | Outcome |
|
|
390
|
+
| --- | --- |
|
|
391
|
+
| Memory Quality Scoring | score recall precision, contradiction risk, accepted memory rate, and stale-memory risk |
|
|
392
|
+
| Provider Fallback Policy | choose between gbrain, agentmemory, code memory, local memory, or no memory with an explicit reason |
|
|
393
|
+
| Context Compression | summarize low-risk context while preserving high-risk evidence verbatim |
|
|
394
|
+
| Skill Strategy Learning | learn preferred tools from successful evidence, failures, and user overrides |
|
|
395
|
+
| Workflow Eval Integration | turn benchmark results into release-gate evidence |
|
|
396
|
+
|
|
397
|
+
Exit criteria:
|
|
398
|
+
|
|
399
|
+
- memory recall has acceptance/rejection feedback
|
|
400
|
+
- context packs show savings, omissions, and evidence-loss warnings
|
|
401
|
+
- skill routing decisions can be compared against outcome quality
|
|
402
|
+
- release notes include measured deltas instead of aspirational percentages
|
|
403
|
+
|
|
404
|
+
### 6.5 0.30.0: Enterprise Governance and Upgrade Maturity
|
|
405
|
+
|
|
406
|
+
Theme: deepen adaptive governance beyond the v0.27.0 baseline.
|
|
407
|
+
|
|
408
|
+
Core work:
|
|
409
|
+
|
|
410
|
+
| Module | Outcome |
|
|
411
|
+
| --- | --- |
|
|
412
|
+
| Adaptive Workflow Router | production policy controls for dynamic gate profiles beyond the v0.27.0 planning output |
|
|
413
|
+
| Evaluator Intelligence | critique and uncertainty gates for architecture/root-cause work |
|
|
414
|
+
| Tool Strategy Planner | cost, retry, fallback, and evidence graph for tools |
|
|
415
|
+
| Evolution Shadow Promotion | lessons become rules only after validation |
|
|
416
|
+
|
|
417
|
+
Exit criteria:
|
|
418
|
+
|
|
419
|
+
- small tasks can stay lightweight with evidence
|
|
420
|
+
- risky tasks escalate automatically
|
|
421
|
+
- reasoning-heavy tasks get critique/evaluator gates
|
|
422
|
+
- evolution proposals can be traced to failure evidence and validation results
|
|
423
|
+
|
|
424
|
+
### 6.6 1.0.0 Beta: AI Engineering OS
|
|
425
|
+
|
|
426
|
+
Theme: integrate governance, memory, context, and tools into an operating layer.
|
|
427
|
+
|
|
428
|
+
Target capabilities:
|
|
429
|
+
|
|
430
|
+
- unified agent workspace policy
|
|
431
|
+
- provider-neutral memory and code intelligence
|
|
432
|
+
- cross-agent execution ledger
|
|
433
|
+
- adaptive workflow templates
|
|
434
|
+
- measurable token and quality reports
|
|
435
|
+
- ecosystem-safe skill and MCP lifecycle governance
|
|
436
|
+
|
|
437
|
+
Release criteria:
|
|
438
|
+
|
|
439
|
+
- install, upgrade, run, dashboard, benchmark, and migration flows work on clean projects
|
|
440
|
+
- at least three representative project types have documented smoke results
|
|
441
|
+
- failure learning produces reviewed rule candidates without silently hardening bad rules
|
|
442
|
+
- bilingual docs explain the core workflow without requiring maintainer context
|
|
443
|
+
- public claims are tied to `WORKFLOW_EVAL`, benchmark output, or release evidence
|
|
444
|
+
|
|
445
|
+
### 6.7 Long-Range Vision: 3-12 Months
|
|
446
|
+
|
|
447
|
+
This is the strategic north star, not the one-week beta promise.
|
|
448
|
+
|
|
449
|
+
| Time horizon | Target state | Evidence required before public claim |
|
|
450
|
+
| --- | --- | --- |
|
|
451
|
+
| 8-12 weeks | AI Engineering OS beta: usable end-to-end loop across planning, execution, verification, memory, and dashboard | repeatable demo projects and benchmark reports |
|
|
452
|
+
| 3-6 months | stable governance runtime: upgrades, adapters, memory providers, and eval gates are reliable in real repositories | release-to-release regression data and field reports |
|
|
453
|
+
| 6-12 months | industry-leading agent engineering layer: adaptive workflows, strategy memory, tool intelligence, and cross-agent governance mature together | comparative evals, sustained issue closure, external adoption evidence |
|
|
454
|
+
|
|
455
|
+
Long-range capability themes:
|
|
456
|
+
|
|
457
|
+
- Cognitive memory: working, episodic, semantic, procedural, and strategy memory with explicit source and freshness controls.
|
|
458
|
+
- Adaptive orchestration: workflows selected by risk, ownership, failure history, and tool reliability instead of one fixed path.
|
|
459
|
+
- Tool intelligence: skills, MCP, CLIs, browser automation, and agent adapters treated as governed capabilities with cost, evidence, and fallback policy.
|
|
460
|
+
- Evaluator intelligence: critique loops, uncertainty scoring, adversarial review, and evidence insufficiency verdicts for reasoning-heavy tasks.
|
|
461
|
+
- Governance economics: token cost, gate friction, verification quality, and maintenance overhead measured as first-class product metrics.
|
|
462
|
+
- Ecosystem governance: external skills, memory providers, adapters, and templates integrated through attribution, license, source pinning, and supply-chain checks.
|
|
463
|
+
|
|
464
|
+
Non-negotiable boundary:
|
|
465
|
+
|
|
466
|
+
> The long-range vision can guide architecture, but it must not be used as a release claim until the corresponding evidence exists.
|
|
467
|
+
|
|
468
|
+
## 7. Measurement Plan
|
|
469
|
+
|
|
470
|
+
Strategic claims must be tied to measurement.
|
|
471
|
+
|
|
472
|
+
| Claim | Required metric |
|
|
473
|
+
| --- | --- |
|
|
474
|
+
| Fewer fake completions | final-check failure rate before/after gates |
|
|
475
|
+
| Fewer skipped steps | FSM blocked transition count and successful recovery rate |
|
|
476
|
+
| Fewer blind retries | repeated-command detector hits and fix iteration count |
|
|
477
|
+
| Lower token use | context pack token count vs baseline full-context load |
|
|
478
|
+
| Better memory | recall precision, accepted memory rate, contradiction count |
|
|
479
|
+
| Better skill use | recommended skill acceptance rate and evidence completion rate |
|
|
480
|
+
| Better workflow quality | pass@1, average fix iterations, failure replay closure rate |
|
|
481
|
+
| Safer dependencies | dependency audit block count and reviewed baseline count |
|
|
482
|
+
|
|
483
|
+
Target ranges can be tracked internally, but public claims should use measured values from `WORKFLOW_EVAL`, runtime evidence, or release reports.
|
|
484
|
+
|
|
485
|
+
## 8. Messaging Rules
|
|
486
|
+
|
|
487
|
+
Use:
|
|
488
|
+
|
|
489
|
+
- "Agent Governance Runtime"
|
|
490
|
+
- "AI Engineering OS direction"
|
|
491
|
+
- "runtime constraints instead of prompt-only discipline"
|
|
492
|
+
- "evidence-backed workflow gates"
|
|
493
|
+
- "provider-based memory and context orchestration"
|
|
494
|
+
|
|
495
|
+
Avoid:
|
|
496
|
+
|
|
497
|
+
- "fully autonomous engineer"
|
|
498
|
+
- "guaranteed 90% AI coding rate"
|
|
499
|
+
- "eliminates hallucination"
|
|
500
|
+
- "zero human governance"
|
|
501
|
+
- "universal memory"
|
|
502
|
+
- "all tools are safe by default"
|
|
503
|
+
|
|
504
|
+
The product message should be ambitious, but the engineering message must stay falsifiable.
|
|
505
|
+
|
|
506
|
+
## 9. Non-Goals
|
|
507
|
+
|
|
508
|
+
SCALE should not try to own every layer.
|
|
509
|
+
|
|
510
|
+
Non-goals:
|
|
511
|
+
|
|
512
|
+
- replacing all agent platforms
|
|
513
|
+
- building a full IDE
|
|
514
|
+
- becoming a generic automation shell
|
|
515
|
+
- implementing every memory backend internally
|
|
516
|
+
- copying external skills without attribution
|
|
517
|
+
- turning every task into heavyweight enterprise ceremony
|
|
518
|
+
|
|
519
|
+
The correct posture is:
|
|
520
|
+
|
|
521
|
+
> Govern agent engineering work, integrate external capability providers, and require evidence at the boundaries.
|
|
522
|
+
|
|
523
|
+
## 10. Documentation Placement
|
|
524
|
+
|
|
525
|
+
Recommended documentation split:
|
|
526
|
+
|
|
527
|
+
| Surface | Content |
|
|
528
|
+
| --- | --- |
|
|
529
|
+
| `README.md` / `README.en.md` | concise positioning, installation, core value, current capabilities |
|
|
530
|
+
| `docs/AI_ENGINEERING_OS_POSITIONING.md` | strategic category, gaps, roadmap, messaging rules |
|
|
531
|
+
| `docs/CONTEXT_BUDGET.md` | context budget and compiler mechanics |
|
|
532
|
+
| `docs/MEMORY_BRAIN.md` / `docs/MEMORY_FABRIC.md` | memory provider and recall behavior |
|
|
533
|
+
| `docs/SKILL_RADAR.md` / `docs/TOOL_ORCHESTRATION.md` | skill and tool routing behavior |
|
|
534
|
+
| `docs/WORKFLOW_EVAL.md` | measurable evidence and improvement claims |
|
|
535
|
+
|
|
536
|
+
README should not absorb this whole strategy. It should link here and keep the first screen user-focused.
|
|
537
|
+
|
|
538
|
+
## 11. Strategic Summary
|
|
539
|
+
|
|
540
|
+
SCALE's strongest current differentiator is not more prompts. It is a runtime governance model for AI engineering:
|
|
541
|
+
|
|
542
|
+
```text
|
|
543
|
+
Agent intent
|
|
544
|
+
-> governed workflow state
|
|
545
|
+
-> scoped context
|
|
546
|
+
-> role/tool policy
|
|
547
|
+
-> evidence-producing execution
|
|
548
|
+
-> verification gates
|
|
549
|
+
-> memory and evolution feedback
|
|
550
|
+
```
|
|
551
|
+
|
|
552
|
+
The next stage is to make this runtime more cognitive:
|
|
553
|
+
|
|
554
|
+
- compile context, do not just load it
|
|
555
|
+
- route memory, do not just store it
|
|
556
|
+
- plan skill use, do not just recommend it
|
|
557
|
+
- adapt workflow, do not just enforce one path
|
|
558
|
+
- validate evolution, do not just summarize lessons
|
|
559
|
+
|
|
560
|
+
If these are implemented with measurable evidence, SCALE can credibly move from "AI workflow engine" to "AI Engineering OS".
|
package/docs/CONTEXT_BUDGET.md
CHANGED
|
@@ -43,6 +43,40 @@ scale context pack \
|
|
|
43
43
|
--json
|
|
44
44
|
```
|
|
45
45
|
|
|
46
|
+
Build the unified AI OS runtime plan that embeds the context pack with memory, skill routing, adaptive workflow, and ROI:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
scale ai-os plan \
|
|
50
|
+
--task-id TASK-123 \
|
|
51
|
+
--task "Review frontend route with browser evidence" \
|
|
52
|
+
--level L \
|
|
53
|
+
--files src/routes/upload.tsx \
|
|
54
|
+
--budget 8000 \
|
|
55
|
+
--json
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
The context pack now uses the baseline Context Compiler. Each candidate section is scored by category, task/file relevance, risk level, and budget fit. The JSON output includes compiler metadata so callers can explain why a section was loaded or omitted:
|
|
59
|
+
|
|
60
|
+
```json
|
|
61
|
+
{
|
|
62
|
+
"compiler": {
|
|
63
|
+
"strategy": "relevance-budget-v1",
|
|
64
|
+
"budget": 4000,
|
|
65
|
+
"totalCandidateTokens": 6200,
|
|
66
|
+
"estimatedTokenSavings": 2200,
|
|
67
|
+
"ranking": [
|
|
68
|
+
{
|
|
69
|
+
"id": "runtime-evidence",
|
|
70
|
+
"included": true,
|
|
71
|
+
"score": 292,
|
|
72
|
+
"matchedSignals": ["evidence", "high-risk-evidence"],
|
|
73
|
+
"reason": "Evidence is needed for completion and verification claims."
|
|
74
|
+
}
|
|
75
|
+
]
|
|
76
|
+
}
|
|
77
|
+
}
|
|
78
|
+
```
|
|
79
|
+
|
|
46
80
|
Evaluate progressive governance mode:
|
|
47
81
|
|
|
48
82
|
```bash
|
|
@@ -109,5 +143,13 @@ This is not a replacement for verification. It only decides which governance beh
|
|
|
109
143
|
|
|
110
144
|
## Governance ROI
|
|
111
145
|
|
|
112
|
-
`scale governance roi` reports both benefit and overhead.
|
|
146
|
+
`scale governance roi` reports both benefit and overhead. In v0.27.0, `scale ai-os plan` also attaches ROI modules for:
|
|
147
|
+
|
|
148
|
+
- `context-budget`
|
|
149
|
+
- `context-compiler`
|
|
150
|
+
- `memory-provider-runtime`
|
|
151
|
+
- `skill-routing-engine`
|
|
152
|
+
- `progressive-governance`
|
|
153
|
+
|
|
154
|
+
Early ROI is still estimated from context budget, compiler savings, recall count, skill evidence steps, and risk signals. Later versions should replace estimates with measured eval data such as file reads saved, tool calls saved, fix iterations reduced, and human corrections avoided.
|
|
113
155
|
|
package/docs/DEPENDENCY_AUDIT.md
CHANGED
|
@@ -28,6 +28,28 @@ scale dependency audit --changed-packages left-pad,@scope/tool --json
|
|
|
28
28
|
|
|
29
29
|
The command exits non-zero when the active mode has blocking findings.
|
|
30
30
|
|
|
31
|
+
## Verification Command Safety
|
|
32
|
+
|
|
33
|
+
SCALE verification commands are security-sensitive because they are often run in CI.
|
|
34
|
+
The core verification paths (`verify-task`, phase verification, workflow eval attempts, and gate commands) execute configured commands without shell expansion by default.
|
|
35
|
+
|
|
36
|
+
Allowed by default:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
npm run build
|
|
40
|
+
npm test -- --runInBand
|
|
41
|
+
node scripts/check.js --changed
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Blocked by default:
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
npm test && curl https://example.com
|
|
48
|
+
node scripts/check.js | tee out.txt
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Shell metacharacters such as `&&`, `|`, `;`, `<`, `>`, backticks, and unquoted `$` are rejected before execution. Use package scripts or checked-in helper scripts for composed commands. `SCALE_ALLOW_SHELL_COMMANDS=1` re-enables shell execution only for trusted local runs and must not be enabled for untrusted PR or user-controlled CI inputs.
|
|
52
|
+
|
|
31
53
|
## G7 Integration
|
|
32
54
|
|
|
33
55
|
`SecurityGate` now emits two first-class evidence sources:
|
|
@@ -82,8 +104,15 @@ The first implementation detects:
|
|
|
82
104
|
- install lifecycle scripts
|
|
83
105
|
- executable bin scripts
|
|
84
106
|
- deprecated packages from lockfile metadata
|
|
107
|
+
- built-in ownership/provenance watchlist matches
|
|
85
108
|
- dynamic code execution: `eval`, `new Function`
|
|
86
109
|
- shell execution patterns
|
|
87
110
|
- suspicious network access patterns
|
|
88
111
|
|
|
112
|
+
The built-in ownership/provenance watchlist currently blocks exact versions that were flagged by external package behavior analysis:
|
|
113
|
+
|
|
114
|
+
- `content-type@2.0.0`
|
|
115
|
+
- `type-is@2.1.0`
|
|
116
|
+
- `type-js@2.1.0` (kept as a defensive alias for reports that use this package name)
|
|
117
|
+
|
|
89
118
|
Future network-backed checks can add npm registry metadata and `npm audit --json` ingestion, but they should stay optional and evidence-backed.
|
package/docs/MEMORY_FABRIC.md
CHANGED
|
@@ -122,6 +122,7 @@ Commands:
|
|
|
122
122
|
scale memory provider init
|
|
123
123
|
scale memory provider status --json
|
|
124
124
|
scale memory provider recall "OAuth callback Redis state" --json
|
|
125
|
+
scale ai-os plan --task "Fix OAuth callback Redis state" --files src/auth/oauth.ts --json
|
|
125
126
|
```
|
|
126
127
|
|
|
127
128
|
Provider rules:
|
|
@@ -130,5 +131,6 @@ Provider rules:
|
|
|
130
131
|
- External providers are read-only by default. Writes require an explicit provider policy change.
|
|
131
132
|
- `scale-local` remains the fallback provider through Memory Brain and only promotes reviewed, evidence-backed memory.
|
|
132
133
|
- `memory pack` automatically includes a `provider-memory` section when provider recall returns relevant active memories.
|
|
134
|
+
- `ai-os plan` includes both the provider recall summary and the Memory Fabric context pack, so agents can route memory before planning without pretending external memory is always available.
|
|
133
135
|
|
|
134
136
|
This keeps agents flexible: they can ask the router for memory before planning, verification, review, or release, while SCALE still records which provider was used and why fallback was required.
|
package/docs/README.md
CHANGED
|
@@ -36,6 +36,7 @@
|
|
|
36
36
|
| [CODE_INTELLIGENCE.md](CODE_INTELLIGENCE.md) | CodeGraph、Graphify 和显式 fallback 的代码智能与探索 ROI |
|
|
37
37
|
| [WORKFLOW_EVAL.md](WORKFLOW_EVAL.md) | Workflow Eval、pass@k 指标、Failure Replay 和改进候选 |
|
|
38
38
|
| [SKILL_RADAR.md](SKILL_RADAR.md) | Skill Radar、能力置信度、证据要求和供应链安全检查 |
|
|
39
|
+
| [AI_ENGINEERING_OS_POSITIONING.md](AI_ENGINEERING_OS_POSITIONING.md) | Agent Governance Runtime / AI Engineering OS 方向、`scale ai-os plan/run/dashboard/benchmark/migrate/doctor` runtime 入口、一周 beta 闭环和 3-12 个月远景路线 |
|
|
39
40
|
| [THIRD_PARTY_SKILLS.md](THIRD_PARTY_SKILLS.md) | 第三方 skill 致谢、授权边界、引用方式和 vendoring 策略 |
|
|
40
41
|
| [EXTERNAL_REFERENCES.md](EXTERNAL_REFERENCES.md) | 外部项目、skills、MCP、CLI 和适配器引用的完整清单 |
|
|
41
42
|
| [UPGRADE_MANAGEMENT.md](UPGRADE_MANAGEMENT.md) | SCALE CLI、governance pack、skills、MCP 和 CLI 工具的安全升级流程 |
|