stable-harness 0.0.7 → 0.0.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -0
- package/docs/0.1.0-p0-runtime-control-plane-plan.zh.md +171 -0
- package/docs/0.1.0-retry-policy.zh.md +87 -0
- package/docs/0.1.0-stable-runtime-development-roadmap.zh.md +393 -0
- package/docs/0.1.0-tool-guard-benchmark.zh.md +42 -0
- package/docs/adapter-contract.md +199 -0
- package/docs/architecture/backend-comparison.md +41 -0
- package/docs/architecture/runtime-events.md +263 -0
- package/docs/architecture/runtime-events.zh.md +248 -0
- package/docs/architecture/system-architecture.zh.md +435 -0
- package/docs/compatibility-matrix.md +139 -0
- package/docs/engineering-rules.md +111 -0
- package/docs/evaluation/0.1.0-bfcl-targeted-model-matrix.zh.md +1632 -0
- package/docs/evaluation/0.1.0-bfcl-targeted-review-matrix.zh.md +1952 -0
- package/docs/evaluation/0.1.0-bfcl-tool-guard.zh.md +1427 -0
- package/docs/granite-tool-calling-comparison.zh.md +206 -0
- package/docs/guides/getting-started.md +126 -0
- package/docs/guides/index.md +40 -0
- package/docs/guides/integration-guide.md +126 -0
- package/docs/guides/operator-runbook.md +153 -0
- package/docs/guides/workspace-authoring.md +212 -0
- package/docs/implementation-blueprint.md +233 -0
- package/docs/memory/0.1.0-memory-design.zh.md +719 -0
- package/docs/memory/0.1.0-step-09-deepagents-native-memory.zh.md +146 -0
- package/docs/memory/0.1.0-step-09-langmem-shaped-provider.zh.md +169 -0
- package/docs/memory/0.1.0-step-09-memory-adapter-projection.zh.md +123 -0
- package/docs/memory/0.1.0-step-09-memory-contract.zh.md +169 -0
- package/docs/memory/0.1.0-step-09-memory-governance-approval.zh.md +143 -0
- package/docs/memory/0.1.0-step-09-memory-lifecycle-hooks.zh.md +150 -0
- package/docs/memory/0.1.0-step-09-memory-maintenance-boundary.zh.md +118 -0
- package/docs/memory/0.1.0-step-09-memory-persistence-boundary.zh.md +118 -0
- package/docs/product/adoption-playbook.md +145 -0
- package/docs/product/market-positioning.md +137 -0
- package/docs/product-boundary.md +258 -0
- package/docs/protocols/http-runtime.md +37 -0
- package/docs/protocols/langgraph-compatible.md +107 -0
- package/docs/protocols/openai-compatible.md +121 -0
- package/docs/tooling/0.1.0-bettercall-tool-quality.zh.md +231 -0
- package/package.json +3 -1
package/README.md
CHANGED
|
@@ -206,6 +206,16 @@ This is constrained repair, not silent magic:
|
|
|
206
206
|
- LangGraph-compatible facade: [docs/protocols/langgraph-compatible.md](docs/protocols/langgraph-compatible.md)
|
|
207
207
|
- HTTP runtime protocol: [docs/protocols/http-runtime.md](docs/protocols/http-runtime.md)
|
|
208
208
|
|
|
209
|
+
## Documentation
|
|
210
|
+
|
|
211
|
+
- [Documentation index](docs/guides/index.md)
|
|
212
|
+
- [Getting started](docs/guides/getting-started.md)
|
|
213
|
+
- [Workspace authoring](docs/guides/workspace-authoring.md)
|
|
214
|
+
- [Integration guide](docs/guides/integration-guide.md)
|
|
215
|
+
- [Operator runbook](docs/guides/operator-runbook.md)
|
|
216
|
+
- [Adoption playbook](docs/product/adoption-playbook.md)
|
|
217
|
+
- [Market positioning](docs/product/market-positioning.md)
|
|
218
|
+
|
|
209
219
|
## Product Boundary
|
|
210
220
|
|
|
211
221
|
Read these before adding public runtime behavior:
|
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
# stable-harness P0 Runtime Control Plane Plan
|
|
2
|
+
|
|
3
|
+
## Goal
|
|
4
|
+
|
|
5
|
+
`stable-harness` must become the product runtime and operator control plane for
|
|
6
|
+
agent workspaces. It must not recreate upstream agent execution semantics.
|
|
7
|
+
|
|
8
|
+
This plan closes the highest-priority gaps against `agent-harness` by adding
|
|
9
|
+
typed stable runtime capabilities that are independently pluggable, testable,
|
|
10
|
+
and replaceable.
|
|
11
|
+
|
|
12
|
+
## Non-Goals
|
|
13
|
+
|
|
14
|
+
- Do not copy `agent-harness` internal module layout.
|
|
15
|
+
- Do not add EasyNet-specific routing, prompt, ticker, news, Kubernetes, QA, or
|
|
16
|
+
release heuristics to core runtime.
|
|
17
|
+
- Do not parse TODO text to create tool calls.
|
|
18
|
+
- Do not recreate DeepAgents planning, delegation, virtual filesystem, or
|
|
19
|
+
upstream tool-call semantics.
|
|
20
|
+
- Do not make compatibility code the native design target.
|
|
21
|
+
|
|
22
|
+
## Capability Classification
|
|
23
|
+
|
|
24
|
+
Useful `agent-harness` behavior must be reclassified before implementation:
|
|
25
|
+
|
|
26
|
+
- Upstream execution semantics: expose through backend adapter passthrough.
|
|
27
|
+
- Runtime/control-plane semantics: implement as stable typed capability.
|
|
28
|
+
- Downstream application logic: keep in workspace config, tools, or prompts.
|
|
29
|
+
- Historical workaround: keep in explicit compat path or delete.
|
|
30
|
+
|
|
31
|
+
## Milestone P0-1: Durable Runtime Stores
|
|
32
|
+
|
|
33
|
+
### Scope
|
|
34
|
+
|
|
35
|
+
Add the minimal stable persistence boundary for runtime records:
|
|
36
|
+
|
|
37
|
+
- `RuntimeStore` interface.
|
|
38
|
+
- In-memory implementation.
|
|
39
|
+
- JSON file implementation for local durable development.
|
|
40
|
+
- Runtime integration through dependency injection.
|
|
41
|
+
- Store-backed run, event, artifact, and state updates.
|
|
42
|
+
|
|
43
|
+
### Acceptance
|
|
44
|
+
|
|
45
|
+
- Existing runtime behavior is unchanged when no store is provided.
|
|
46
|
+
- A file-backed runtime can survive runtime recreation.
|
|
47
|
+
- Events are appended through the store.
|
|
48
|
+
- Artifacts remain request-scoped.
|
|
49
|
+
- No source file exceeds project size limits.
|
|
50
|
+
|
|
51
|
+
### Tests
|
|
52
|
+
|
|
53
|
+
- Unit tests for in-memory store.
|
|
54
|
+
- Unit tests for JSON file store reload.
|
|
55
|
+
- Runtime request test with injected file store.
|
|
56
|
+
- `npm run check`.
|
|
57
|
+
- `npm run check:rules`.
|
|
58
|
+
- `npm test`.
|
|
59
|
+
|
|
60
|
+
## Milestone P0-2: Operator Inspection Contract
|
|
61
|
+
|
|
62
|
+
### Scope
|
|
63
|
+
|
|
64
|
+
Extend native inspection without copying legacy structures:
|
|
65
|
+
|
|
66
|
+
- Session summary.
|
|
67
|
+
- Request summary.
|
|
68
|
+
- Request detail.
|
|
69
|
+
- Runtime snapshot for workspace/agent/model/tool binding.
|
|
70
|
+
- Event timeline projection.
|
|
71
|
+
- Artifact listing projection.
|
|
72
|
+
|
|
73
|
+
### Acceptance
|
|
74
|
+
|
|
75
|
+
- Operators can inspect a request without reading raw run internals.
|
|
76
|
+
- Protocol client can fetch request/session projections.
|
|
77
|
+
- Existing `inspect()` remains backward-compatible.
|
|
78
|
+
|
|
79
|
+
### Tests
|
|
80
|
+
|
|
81
|
+
- Runtime inspection unit tests.
|
|
82
|
+
- HTTP/in-process protocol tests.
|
|
83
|
+
- Existing runtime and trace tests.
|
|
84
|
+
|
|
85
|
+
## Milestone P0-3: Queue and Recovery Core
|
|
86
|
+
|
|
87
|
+
### Scope
|
|
88
|
+
|
|
89
|
+
Add request lifecycle primitives:
|
|
90
|
+
|
|
91
|
+
- Durable queue record.
|
|
92
|
+
- Priority and queue key.
|
|
93
|
+
- Claim/lease/heartbeat.
|
|
94
|
+
- Cancel intent.
|
|
95
|
+
- Stuck request detection.
|
|
96
|
+
- Recovery intent record.
|
|
97
|
+
|
|
98
|
+
### Acceptance
|
|
99
|
+
|
|
100
|
+
- Queue logic is typed and domain-neutral.
|
|
101
|
+
- Recovery decisions come from runtime state, not prompt text.
|
|
102
|
+
- Running requests can be inspected and reconciled after restart.
|
|
103
|
+
|
|
104
|
+
### Tests
|
|
105
|
+
|
|
106
|
+
- Queue lease and expiration tests.
|
|
107
|
+
- Heartbeat/stuck detection tests.
|
|
108
|
+
- Cancel intent tests.
|
|
109
|
+
- Recovery intent persistence tests.
|
|
110
|
+
|
|
111
|
+
## Milestone P0-4: Artifacts, Evidence, and Replay Bundles
|
|
112
|
+
|
|
113
|
+
### Scope
|
|
114
|
+
|
|
115
|
+
Make evidence first-class:
|
|
116
|
+
|
|
117
|
+
- Durable artifact content store.
|
|
118
|
+
- Artifact listing and read APIs.
|
|
119
|
+
- Evaluation bundle export.
|
|
120
|
+
- Replay bundle validation.
|
|
121
|
+
- Hash/metadata checks for stored artifacts.
|
|
122
|
+
|
|
123
|
+
### Acceptance
|
|
124
|
+
|
|
125
|
+
- Request evidence can be exported from runtime records alone.
|
|
126
|
+
- Replay bundle contains runs, events, artifacts, and runtime metadata.
|
|
127
|
+
- Artifact APIs are independent from backend adapters.
|
|
128
|
+
|
|
129
|
+
### Tests
|
|
130
|
+
|
|
131
|
+
- Artifact create/list/read tests.
|
|
132
|
+
- Evaluation bundle export tests.
|
|
133
|
+
- Replay validation tests.
|
|
134
|
+
- Existing EasyNet native stable tests.
|
|
135
|
+
|
|
136
|
+
## Milestone P0-5: Native NL Migration Gate
|
|
137
|
+
|
|
138
|
+
### Scope
|
|
139
|
+
|
|
140
|
+
Move natural-language execution toward native stable runtime:
|
|
141
|
+
|
|
142
|
+
- Keep DeepAgents as upstream execution source of truth.
|
|
143
|
+
- Preserve trace/evidence visibility through stable events.
|
|
144
|
+
- Keep compatibility facade explicit.
|
|
145
|
+
- Remove native dependence on compatibility runner when upstream behavior is
|
|
146
|
+
stable enough.
|
|
147
|
+
|
|
148
|
+
### Acceptance
|
|
149
|
+
|
|
150
|
+
- EasyNet typed native gates pass.
|
|
151
|
+
- EasyNet natural-language matrix has a native stable path or documented
|
|
152
|
+
upstream blocker.
|
|
153
|
+
- No downstream-specific runtime heuristics are added.
|
|
154
|
+
|
|
155
|
+
### Tests
|
|
156
|
+
|
|
157
|
+
- Stable native CLI tests.
|
|
158
|
+
- EasyNet native stable tests.
|
|
159
|
+
- EasyNet full matrix as migration gate.
|
|
160
|
+
|
|
161
|
+
## Execution Rules
|
|
162
|
+
|
|
163
|
+
Each milestone must follow this loop:
|
|
164
|
+
|
|
165
|
+
1. Inspect existing code and tests.
|
|
166
|
+
2. Implement the smallest typed stable capability.
|
|
167
|
+
3. Add focused tests.
|
|
168
|
+
4. Run `npm run check`.
|
|
169
|
+
5. Run `npm run check:rules`.
|
|
170
|
+
6. Run `npm test`.
|
|
171
|
+
7. Only then continue to the next milestone.
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
# Retry Policy
|
|
2
|
+
|
|
3
|
+
`stable-harness` 的 retry 只覆盖生产稳定性,不接管 DeepAgents 的执行语义。
|
|
4
|
+
|
|
5
|
+
## 边界
|
|
6
|
+
|
|
7
|
+
| 场景 | 归属 | 行为 |
|
|
8
|
+
| --- | --- | --- |
|
|
9
|
+
| 模型 API timeout、rate limit、5xx | runtime policy | 通过 LangChain `modelRetryMiddleware` 重试同一次 model call |
|
|
10
|
+
| 工具后端 timeout、临时网络失败 | runtime policy | 通过 LangChain `toolRetryMiddleware` 重试同一次 tool call |
|
|
11
|
+
| tool 参数类型错、枚举错、语义错 | tool gateway guard | 返回 `ToolMessage(status="error")`,让模型下一轮自我修复 |
|
|
12
|
+
| 模型看到 ToolMessage 后是否重新调用 tool | DeepAgents/LangChain agent loop | 不做确定性保证,需要 benchmark |
|
|
13
|
+
|
|
14
|
+
## YAML
|
|
15
|
+
|
|
16
|
+
```yaml
|
|
17
|
+
apiVersion: stable-harness.dev/v1
|
|
18
|
+
kind: Runtime
|
|
19
|
+
metadata:
|
|
20
|
+
name: production
|
|
21
|
+
spec:
|
|
22
|
+
routing:
|
|
23
|
+
defaultAgentId: orchestra
|
|
24
|
+
retry:
|
|
25
|
+
model:
|
|
26
|
+
enabled: true
|
|
27
|
+
maxRetries: 3
|
|
28
|
+
initialDelayMs: 1000
|
|
29
|
+
backoffFactor: 2
|
|
30
|
+
jitter: true
|
|
31
|
+
onFailure: continue
|
|
32
|
+
tools:
|
|
33
|
+
enabled: true
|
|
34
|
+
tools:
|
|
35
|
+
- web_search
|
|
36
|
+
- fetch_url
|
|
37
|
+
retryOn:
|
|
38
|
+
- timeout
|
|
39
|
+
- network
|
|
40
|
+
- rateLimit
|
|
41
|
+
- serverError
|
|
42
|
+
maxRetries: 2
|
|
43
|
+
initialDelayMs: 500
|
|
44
|
+
backoffFactor: 2
|
|
45
|
+
jitter: true
|
|
46
|
+
onFailure: continue
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
`retryOn` 是稳定 runtime 的字符串白名单,不接收 JavaScript function。默认等价于 `timeout/network/rateLimit/serverError`。`tools` 建议只配置会遇到临时失败的外部 I/O 工具。不要把 schema/参数修复放到 retry policy 里;参数修复由 `ToolArgumentGuard` 生成可读错误并交回模型。
|
|
50
|
+
|
|
51
|
+
## Sequence
|
|
52
|
+
|
|
53
|
+
```mermaid
|
|
54
|
+
sequenceDiagram
|
|
55
|
+
participant User
|
|
56
|
+
participant Runtime as stable-harness runtime
|
|
57
|
+
participant Adapter as DeepAgents adapter
|
|
58
|
+
participant LC as LangChain retry middleware
|
|
59
|
+
participant DA as DeepAgents agent loop
|
|
60
|
+
participant Tool as Tool gateway
|
|
61
|
+
participant Model as LLM
|
|
62
|
+
|
|
63
|
+
User->>Runtime: request
|
|
64
|
+
Runtime->>Adapter: run with retry policy
|
|
65
|
+
Adapter->>LC: install model/tool retry middleware
|
|
66
|
+
Adapter->>DA: createDeepAgent(params)
|
|
67
|
+
DA->>Model: model call
|
|
68
|
+
alt transient model failure
|
|
69
|
+
LC->>Model: retry model call
|
|
70
|
+
end
|
|
71
|
+
Model->>DA: tool call
|
|
72
|
+
DA->>Tool: invoke
|
|
73
|
+
alt transient tool failure
|
|
74
|
+
LC->>Tool: retry same tool call
|
|
75
|
+
else argument validation failure
|
|
76
|
+
Tool-->>DA: ToolMessage(status="error")
|
|
77
|
+
DA->>Model: feed error observation
|
|
78
|
+
Model->>DA: maybe corrected tool call
|
|
79
|
+
end
|
|
80
|
+
DA-->>Runtime: final result
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## 当前测试
|
|
84
|
+
|
|
85
|
+
- `DeepAgents retry policy is translated to upstream middleware`:确认 `Runtime.spec.retry` 被 adapter 翻译成 LangChain 官方 middleware。
|
|
86
|
+
- `DeepAgents tool retry policy retries transient gateway failures`:真实 DeepAgents + fake tool-calling model,第一次 gateway tool 抛临时错误,第二次重试成功。
|
|
87
|
+
- `npm run benchmark:retry-policy`:对比 retry policy on/off 的成功率、attempts 和耗时。
|
|
@@ -0,0 +1,393 @@
|
|
|
1
|
+
# Stable Harness Runtime 改造路线图
|
|
2
|
+
|
|
3
|
+
本文档定义 `stable-harness` 后续改造的逐步开发清单。目标是把 `stable-harness` 建成干净的 runtime / operator control plane,而不是复刻 `agent-harness` 或把 EasyNet 业务规则写入 runtime。
|
|
4
|
+
|
|
5
|
+
## 硬性执行规则
|
|
6
|
+
|
|
7
|
+
- 每完成一个开发步骤,必须运行 EasyNet 完整真实验证。
|
|
8
|
+
- EasyNet 验证必须连接真实模型,运行真实 workspace、真实 tools、真实数据路径。
|
|
9
|
+
- 每完成一个开发步骤,必须在 `docs/` 下新增一份中文报告。
|
|
10
|
+
- 每份步骤报告必须包含:
|
|
11
|
+
- 改造目标
|
|
12
|
+
- 代码改动范围
|
|
13
|
+
- runtime boundary 判断
|
|
14
|
+
- EasyNet 真实测试命令和结果
|
|
15
|
+
- 失败、重试、风险和残留问题
|
|
16
|
+
- sequence diagram
|
|
17
|
+
- flow chart
|
|
18
|
+
- 不允许为了通过 EasyNet case 把 EasyNet 业务规则写入 `stable-harness` runtime。
|
|
19
|
+
- `runtime/compat` 和 `compat/*` 只能作为迁移路径,不能承载 native runtime 新能力。
|
|
20
|
+
|
|
21
|
+
## 每步统一验证门槛
|
|
22
|
+
|
|
23
|
+
在每个步骤结束前,至少运行:
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
npm run check
|
|
27
|
+
npm run check:rules
|
|
28
|
+
npm test
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
在 EasyNet 中运行:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
npm test
|
|
35
|
+
npm run test:botbotgo:full
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
必要时增加定向过滤重试:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
EASYNET_FULL_MATRIX_FILTER=<case_id> npm run test:botbotgo:full
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
测试记录必须说明:
|
|
45
|
+
|
|
46
|
+
- 使用的 workspace:`/Users/boqiangliang/project/easynet`
|
|
47
|
+
- 使用的 runtime:`stable-harness -> file:../stable-harness`
|
|
48
|
+
- 是否连接真实模型
|
|
49
|
+
- 是否执行真实工具
|
|
50
|
+
- 是否存在未跟踪文件、外部服务失败、模型波动或 Kubernetes 环境限制
|
|
51
|
+
|
|
52
|
+
## 开发步骤清单
|
|
53
|
+
|
|
54
|
+
### 1. RunStore / EventStore / ArtifactStore
|
|
55
|
+
|
|
56
|
+
目标:
|
|
57
|
+
|
|
58
|
+
- 把当前 core runtime 内部的 in-memory `Map` 拆成可替换 store interface。
|
|
59
|
+
- runtime 只依赖 store contract,不直接绑定内存实现。
|
|
60
|
+
|
|
61
|
+
交付:
|
|
62
|
+
|
|
63
|
+
- `RunStore`
|
|
64
|
+
- `EventStore`
|
|
65
|
+
- `ArtifactStore`
|
|
66
|
+
- in-memory store implementation
|
|
67
|
+
- core runtime 使用 store
|
|
68
|
+
- store-focused tests
|
|
69
|
+
|
|
70
|
+
禁止:
|
|
71
|
+
|
|
72
|
+
- 不引入 SQLite 作为第一步默认实现。
|
|
73
|
+
- 不复制 `agent-harness` 的 persistence 结构。
|
|
74
|
+
|
|
75
|
+
验收:
|
|
76
|
+
|
|
77
|
+
- runtime inspection 仍可返回 run/event/artifact。
|
|
78
|
+
- EasyNet 完整真实测试通过。
|
|
79
|
+
|
|
80
|
+
### 2. Runtime Event Model 标准化
|
|
81
|
+
|
|
82
|
+
目标:
|
|
83
|
+
|
|
84
|
+
- 把工具事件、delegation 事件、approval 事件、artifact 事件统一成稳定 event envelope。
|
|
85
|
+
- CLI 和 protocols 消费统一事件,不直接消费 compat runner 的临时 delta。
|
|
86
|
+
|
|
87
|
+
交付:
|
|
88
|
+
|
|
89
|
+
- `RuntimeEventEnvelope`
|
|
90
|
+
- typed event payloads
|
|
91
|
+
- event projection helpers
|
|
92
|
+
- event store append/read tests
|
|
93
|
+
|
|
94
|
+
禁止:
|
|
95
|
+
|
|
96
|
+
- 不解析 TODO 文本生成事件。
|
|
97
|
+
- 不写 EasyNet specialist 事件特例。
|
|
98
|
+
|
|
99
|
+
验收:
|
|
100
|
+
|
|
101
|
+
- EasyNet CLI trace 仍显示 delegation、tool start/result、TODO trace。
|
|
102
|
+
- 完整真实测试通过。
|
|
103
|
+
|
|
104
|
+
### 3. ToolGateway 接入 Runtime
|
|
105
|
+
|
|
106
|
+
目标:
|
|
107
|
+
|
|
108
|
+
- 把 native tool execution 从直接调用迁到 `@stable-harness/tool-gateway`。
|
|
109
|
+
- compat runner 只在迁移阶段保留 direct invocation。
|
|
110
|
+
|
|
111
|
+
交付:
|
|
112
|
+
|
|
113
|
+
- runtime-level tool gateway injection
|
|
114
|
+
- tool invocation context
|
|
115
|
+
- tool start/result/error events
|
|
116
|
+
- schema validation hook
|
|
117
|
+
- focused gateway tests
|
|
118
|
+
|
|
119
|
+
禁止:
|
|
120
|
+
|
|
121
|
+
- 不在 gateway 内做自然语言工具选择。
|
|
122
|
+
- 不把 finance/k8s/git/qa 等工具规则写入 gateway。
|
|
123
|
+
|
|
124
|
+
验收:
|
|
125
|
+
|
|
126
|
+
- EasyNet 所有 specialist tools 仍通过真实工具执行。
|
|
127
|
+
- 完整真实测试通过。
|
|
128
|
+
|
|
129
|
+
### 4. Governance / Approval Queue
|
|
130
|
+
|
|
131
|
+
目标:
|
|
132
|
+
|
|
133
|
+
- 把 approval、sandbox、resource limit 变成独立 runtime capability。
|
|
134
|
+
- 工具执行前可基于 typed policy 进入 approval queue。
|
|
135
|
+
|
|
136
|
+
交付:
|
|
137
|
+
|
|
138
|
+
- `ApprovalQueue`
|
|
139
|
+
- `GovernanceDecision`
|
|
140
|
+
- policy evaluation events
|
|
141
|
+
- allow/deny/resume lifecycle
|
|
142
|
+
- tests for allow / require approval / deny
|
|
143
|
+
|
|
144
|
+
禁止:
|
|
145
|
+
|
|
146
|
+
- 不用 prompt 文本判断是否需要 approval。
|
|
147
|
+
- 不让 adapter 自己实现 approval lifecycle。
|
|
148
|
+
|
|
149
|
+
验收:
|
|
150
|
+
|
|
151
|
+
- EasyNet 默认无 approval 阻塞,完整真实测试通过。
|
|
152
|
+
- 新 approval tests 证明 capability 可关闭、可替换。
|
|
153
|
+
|
|
154
|
+
### 5. DeepAgents Native Path 对齐
|
|
155
|
+
|
|
156
|
+
目标:
|
|
157
|
+
|
|
158
|
+
- 让 native DeepAgents adapter 成为默认设计目标。
|
|
159
|
+
- `createDeepAgent` 参数尽量 passthrough upstream primitives。
|
|
160
|
+
|
|
161
|
+
交付:
|
|
162
|
+
|
|
163
|
+
- model config passthrough
|
|
164
|
+
- tools passthrough or gateway bridge
|
|
165
|
+
- subagents passthrough
|
|
166
|
+
- memory / skills passthrough
|
|
167
|
+
- upstream event normalization
|
|
168
|
+
- DeepAgents capability audit report
|
|
169
|
+
|
|
170
|
+
禁止:
|
|
171
|
+
|
|
172
|
+
- 不重建 DeepAgents middleware stack。
|
|
173
|
+
- 不实现第二套 subagent planning language。
|
|
174
|
+
- 不 replay upstream custom tool calls。
|
|
175
|
+
|
|
176
|
+
验收:
|
|
177
|
+
|
|
178
|
+
- native path tests 通过。
|
|
179
|
+
- EasyNet migration path 仍完整通过。
|
|
180
|
+
|
|
181
|
+
### 6. Compat Runner 收缩
|
|
182
|
+
|
|
183
|
+
目标:
|
|
184
|
+
|
|
185
|
+
- 把 `runtime/compat` 保持为 migration-only。
|
|
186
|
+
- 能迁出的能力迁到 native capability 或 upstream passthrough。
|
|
187
|
+
|
|
188
|
+
交付:
|
|
189
|
+
|
|
190
|
+
- compat usage inventory
|
|
191
|
+
- migration blockers list
|
|
192
|
+
- compat-only behavior tags
|
|
193
|
+
- removal plan
|
|
194
|
+
|
|
195
|
+
禁止:
|
|
196
|
+
|
|
197
|
+
- 不在 compat runner 中新增 native runtime 功能。
|
|
198
|
+
- 不把 compat API 扩展为产品 API。
|
|
199
|
+
|
|
200
|
+
验收:
|
|
201
|
+
|
|
202
|
+
- EasyNet 完整真实测试通过。
|
|
203
|
+
- docs 中明确每个剩余 compat 行为的归宿。
|
|
204
|
+
|
|
205
|
+
### 7. Protocol Surface
|
|
206
|
+
|
|
207
|
+
目标:
|
|
208
|
+
|
|
209
|
+
- HTTP / in-process / future ACP / A2A / AG-UI 都调用同一 runtime contract。
|
|
210
|
+
|
|
211
|
+
交付:
|
|
212
|
+
|
|
213
|
+
- request API
|
|
214
|
+
- event stream API
|
|
215
|
+
- run inspection API
|
|
216
|
+
- approval API
|
|
217
|
+
- artifact API
|
|
218
|
+
- protocol tests
|
|
219
|
+
|
|
220
|
+
禁止:
|
|
221
|
+
|
|
222
|
+
- protocol 层不执行 agent。
|
|
223
|
+
- protocol 层不包含 backend 或 workspace 业务逻辑。
|
|
224
|
+
|
|
225
|
+
验收:
|
|
226
|
+
|
|
227
|
+
- protocol tests 通过。
|
|
228
|
+
- EasyNet 完整真实测试通过。
|
|
229
|
+
|
|
230
|
+
### 8. Replay / Evaluation
|
|
231
|
+
|
|
232
|
+
目标:
|
|
233
|
+
|
|
234
|
+
- 基于 runtime events 和 artifacts 做 replay / eval,而不是重跑 prompt heuristics。
|
|
235
|
+
|
|
236
|
+
交付:
|
|
237
|
+
|
|
238
|
+
- replay manifest
|
|
239
|
+
- evaluation fixture runner
|
|
240
|
+
- trace export
|
|
241
|
+
- artifact reference validation
|
|
242
|
+
|
|
243
|
+
禁止:
|
|
244
|
+
|
|
245
|
+
- 不从自然语言输出反推工具调用。
|
|
246
|
+
- 不把某个 EasyNet case 的断言写成 runtime 规则。
|
|
247
|
+
|
|
248
|
+
验收:
|
|
249
|
+
|
|
250
|
+
- replay tests 通过。
|
|
251
|
+
- EasyNet 完整真实测试通过。
|
|
252
|
+
|
|
253
|
+
### 9. Memory Lifecycle
|
|
254
|
+
|
|
255
|
+
目标:
|
|
256
|
+
|
|
257
|
+
- runtime 管 memory lifecycle,不替代 backend-native memory semantics。
|
|
258
|
+
|
|
259
|
+
交付:
|
|
260
|
+
|
|
261
|
+
- memory namespace contract
|
|
262
|
+
- recall coordination events
|
|
263
|
+
- import/export hooks
|
|
264
|
+
- compaction hooks
|
|
265
|
+
- tests with in-memory store
|
|
266
|
+
|
|
267
|
+
禁止:
|
|
268
|
+
|
|
269
|
+
- 不伪造统一记忆语义覆盖 DeepAgents native memory。
|
|
270
|
+
- 不在 memory capability 里做业务 routing。
|
|
271
|
+
|
|
272
|
+
验收:
|
|
273
|
+
|
|
274
|
+
- memory tests 通过。
|
|
275
|
+
- EasyNet 完整真实测试通过。
|
|
276
|
+
|
|
277
|
+
### 10. Native Stable Package Migration
|
|
278
|
+
|
|
279
|
+
目标:
|
|
280
|
+
|
|
281
|
+
- EasyNet 从 explicit compat facade 迁到 stable native API。
|
|
282
|
+
|
|
283
|
+
交付:
|
|
284
|
+
|
|
285
|
+
- EasyNet dependency migration plan
|
|
286
|
+
- native API usage examples
|
|
287
|
+
- compat fallback plan
|
|
288
|
+
- final migration test report
|
|
289
|
+
|
|
290
|
+
禁止:
|
|
291
|
+
|
|
292
|
+
- 不为了迁移削弱 EasyNet contract tests。
|
|
293
|
+
- 不隐藏 JSON contract / specialist ownership / tool boundary。
|
|
294
|
+
|
|
295
|
+
验收:
|
|
296
|
+
|
|
297
|
+
- EasyNet `npm test` 通过。
|
|
298
|
+
- EasyNet `npm run test:botbotgo:full` 通过。
|
|
299
|
+
- EasyNet 不再依赖 compat-only API,或明确列出最后 blockers。
|
|
300
|
+
|
|
301
|
+
## 总体 Sequence Diagram
|
|
302
|
+
|
|
303
|
+
```mermaid
|
|
304
|
+
sequenceDiagram
|
|
305
|
+
participant Dev as Developer
|
|
306
|
+
participant SH as stable-harness
|
|
307
|
+
participant EN as EasyNet
|
|
308
|
+
participant Model as Real Model
|
|
309
|
+
participant Tools as Real Tools/Data
|
|
310
|
+
participant Docs as docs/
|
|
311
|
+
|
|
312
|
+
Dev->>SH: Implement one runtime capability
|
|
313
|
+
SH->>SH: npm run check
|
|
314
|
+
SH->>SH: npm run check:rules
|
|
315
|
+
SH->>SH: npm test
|
|
316
|
+
Dev->>EN: npm test
|
|
317
|
+
EN->>Model: Real model calls
|
|
318
|
+
EN->>Tools: Real tool/data execution
|
|
319
|
+
Dev->>EN: npm run test:botbotgo:full
|
|
320
|
+
EN->>Model: Full matrix real model calls
|
|
321
|
+
EN->>Tools: Full matrix real tools/data
|
|
322
|
+
Dev->>Docs: Write Chinese step report
|
|
323
|
+
Docs->>Docs: Include sequence diagram and flow chart
|
|
324
|
+
Dev->>SH: Commit only after all gates pass
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
## 总体 Flow Chart
|
|
328
|
+
|
|
329
|
+
```mermaid
|
|
330
|
+
flowchart TD
|
|
331
|
+
A["Pick next runtime capability"] --> B["Classify boundary"]
|
|
332
|
+
B --> C{"Who owns the behavior?"}
|
|
333
|
+
C -->|Upstream framework| D["Adapter passthrough"]
|
|
334
|
+
C -->|Runtime/control plane| E["Typed stable capability"]
|
|
335
|
+
C -->|Downstream app| F["Workspace config/tool/test"]
|
|
336
|
+
C -->|Historical workaround| G["Compat only or delete"]
|
|
337
|
+
D --> H["Implement narrow change"]
|
|
338
|
+
E --> H
|
|
339
|
+
F --> H
|
|
340
|
+
G --> H
|
|
341
|
+
H --> I["Run stable-harness checks/tests"]
|
|
342
|
+
I --> J["Run EasyNet npm test"]
|
|
343
|
+
J --> K["Run EasyNet full botbotgo matrix"]
|
|
344
|
+
K --> L{"All real gates pass?"}
|
|
345
|
+
L -->|No| M["Fix at correct boundary"]
|
|
346
|
+
M --> B
|
|
347
|
+
L -->|Yes| N["Write Chinese docs report"]
|
|
348
|
+
N --> O["Commit and push"]
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
## 每步报告模板
|
|
352
|
+
|
|
353
|
+
每个步骤完成后,在 `docs/` 新增:
|
|
354
|
+
|
|
355
|
+
```text
|
|
356
|
+
0.1.0-step-<number>-<short-name>.zh.md
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
模板:
|
|
360
|
+
|
|
361
|
+
```markdown
|
|
362
|
+
# Step <number>: <name>
|
|
363
|
+
|
|
364
|
+
## 目标
|
|
365
|
+
|
|
366
|
+
## 改动范围
|
|
367
|
+
|
|
368
|
+
## Runtime Boundary 判断
|
|
369
|
+
|
|
370
|
+
## 实现细节
|
|
371
|
+
|
|
372
|
+
## EasyNet 真实测试
|
|
373
|
+
|
|
374
|
+
### 命令
|
|
375
|
+
|
|
376
|
+
### 结果
|
|
377
|
+
|
|
378
|
+
### 模型和数据说明
|
|
379
|
+
|
|
380
|
+
## 风险和残留问题
|
|
381
|
+
|
|
382
|
+
## Sequence Diagram
|
|
383
|
+
|
|
384
|
+
```mermaid
|
|
385
|
+
sequenceDiagram
|
|
386
|
+
```
|
|
387
|
+
|
|
388
|
+
## Flow Chart
|
|
389
|
+
|
|
390
|
+
```mermaid
|
|
391
|
+
flowchart TD
|
|
392
|
+
```
|
|
393
|
+
```
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
# 0.1.0 Tool Guard Benchmark
|
|
2
|
+
|
|
3
|
+
生成时间:2026-05-07T00:40:17.180Z
|
|
4
|
+
|
|
5
|
+
## 测试设置
|
|
6
|
+
|
|
7
|
+
- 远端 Ollama:`https://ollama-rtx-4070.easynet.world`
|
|
8
|
+
- 每个模型自然用例轮数:`10`,总自然用例数为 `50`
|
|
9
|
+
- 注入错误矩阵覆盖:未知工具、错误工具名、缺必填、类型错、enum 错、extra arg、绝对路径、语义 ticker 错、不可解析参数
|
|
10
|
+
- 该 benchmark 是产品级 fault-injection 与本地 BFCL-style 子集,不是 BFCL 官方成绩。
|
|
11
|
+
|
|
12
|
+
## 自然工具调用
|
|
13
|
+
|
|
14
|
+
| 模型 | Repair | 自然用例数 | Exact | Baseline Accepted | Bad Exec 无 Guard | Bad Exec 有 Guard | Final Accepted |
|
|
15
|
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
|
16
|
+
| qwen3:0.6b | off | 50 | 80% | 80% | 20% | 0% | 80% |
|
|
17
|
+
| qwen3:0.6b | on | 50 | 80% | 80% | 20% | 0% | 80% |
|
|
18
|
+
| qwen3.5:0.8b | off | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
19
|
+
| qwen3.5:0.8b | on | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
20
|
+
| qwen3.5:2b | off | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
21
|
+
| qwen3.5:2b | on | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
22
|
+
| granite4.1:3b | off | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
23
|
+
| granite4.1:3b | on | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
24
|
+
| qwen3.5:4b | off | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
25
|
+
| qwen3.5:4b | on | 50 | 100% | 100% | 0% | 0% | 100% |
|
|
26
|
+
|
|
27
|
+
## 注入错误矩阵
|
|
28
|
+
|
|
29
|
+
| 模型 | 注入错误 Guard 拦截 | 注入错误 Repair 成功 | 覆盖错误类型 |
|
|
30
|
+
| --- | --- | --- | --- |
|
|
31
|
+
| qwen3:0.6b | 100% | 66.7% | name, schema, type, semantic |
|
|
32
|
+
| qwen3.5:0.8b | 100% | 66.7% | name, schema, type, semantic |
|
|
33
|
+
| qwen3.5:2b | 100% | 100% | name, schema, type, semantic |
|
|
34
|
+
| granite4.1:3b | 100% | 100% | name, schema, type, semantic |
|
|
35
|
+
| qwen3.5:4b | 100% | 100% | name, schema, type, semantic |
|
|
36
|
+
|
|
37
|
+
## 结论
|
|
38
|
+
|
|
39
|
+
- Guard 的核心收益是阻止错误 tool call 进入真实执行层;在本轮测试里,所有注入错误都被 100% 拦截。
|
|
40
|
+
- `qwen3:0.6b` 的自然输出存在 20% 原本会错误执行的 registered tool call,开启 Guard 后 bad execution 从 20% 降到 0%。
|
|
41
|
+
- `qwen3.5:2b`、`granite4.1:3b`、`qwen3.5:4b` 对注入错误的一轮 repair 成功率为 100%。这个结论只适用于本 benchmark 的注入错误矩阵。
|
|
42
|
+
- `qwen3.5:0.8b` 及以上在本轮自然用例里 baseline 已经是 100%,所以自然场景没有可观察的 accepted-rate uplift。
|