stable-harness 0.0.7 → 0.0.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. package/README.md +10 -0
  2. package/docs/0.1.0-p0-runtime-control-plane-plan.zh.md +171 -0
  3. package/docs/0.1.0-retry-policy.zh.md +87 -0
  4. package/docs/0.1.0-stable-runtime-development-roadmap.zh.md +393 -0
  5. package/docs/0.1.0-tool-guard-benchmark.zh.md +42 -0
  6. package/docs/adapter-contract.md +199 -0
  7. package/docs/architecture/backend-comparison.md +41 -0
  8. package/docs/architecture/runtime-events.md +263 -0
  9. package/docs/architecture/runtime-events.zh.md +248 -0
  10. package/docs/architecture/system-architecture.zh.md +435 -0
  11. package/docs/compatibility-matrix.md +139 -0
  12. package/docs/engineering-rules.md +111 -0
  13. package/docs/evaluation/0.1.0-bfcl-targeted-model-matrix.zh.md +1632 -0
  14. package/docs/evaluation/0.1.0-bfcl-targeted-review-matrix.zh.md +1952 -0
  15. package/docs/evaluation/0.1.0-bfcl-tool-guard.zh.md +1427 -0
  16. package/docs/granite-tool-calling-comparison.zh.md +206 -0
  17. package/docs/guides/getting-started.md +126 -0
  18. package/docs/guides/index.md +40 -0
  19. package/docs/guides/integration-guide.md +126 -0
  20. package/docs/guides/operator-runbook.md +153 -0
  21. package/docs/guides/workspace-authoring.md +212 -0
  22. package/docs/implementation-blueprint.md +233 -0
  23. package/docs/memory/0.1.0-memory-design.zh.md +719 -0
  24. package/docs/memory/0.1.0-step-09-deepagents-native-memory.zh.md +146 -0
  25. package/docs/memory/0.1.0-step-09-langmem-shaped-provider.zh.md +169 -0
  26. package/docs/memory/0.1.0-step-09-memory-adapter-projection.zh.md +123 -0
  27. package/docs/memory/0.1.0-step-09-memory-contract.zh.md +169 -0
  28. package/docs/memory/0.1.0-step-09-memory-governance-approval.zh.md +143 -0
  29. package/docs/memory/0.1.0-step-09-memory-lifecycle-hooks.zh.md +150 -0
  30. package/docs/memory/0.1.0-step-09-memory-maintenance-boundary.zh.md +118 -0
  31. package/docs/memory/0.1.0-step-09-memory-persistence-boundary.zh.md +118 -0
  32. package/docs/product/adoption-playbook.md +145 -0
  33. package/docs/product/market-positioning.md +137 -0
  34. package/docs/product-boundary.md +258 -0
  35. package/docs/protocols/http-runtime.md +37 -0
  36. package/docs/protocols/langgraph-compatible.md +107 -0
  37. package/docs/protocols/openai-compatible.md +121 -0
  38. package/docs/tooling/0.1.0-bettercall-tool-quality.zh.md +231 -0
  39. package/package.json +3 -1
package/README.md CHANGED
@@ -206,6 +206,16 @@ This is constrained repair, not silent magic:
206
206
  - LangGraph-compatible facade: [docs/protocols/langgraph-compatible.md](docs/protocols/langgraph-compatible.md)
207
207
  - HTTP runtime protocol: [docs/protocols/http-runtime.md](docs/protocols/http-runtime.md)
208
208
 
209
+ ## Documentation
210
+
211
+ - [Documentation index](docs/guides/index.md)
212
+ - [Getting started](docs/guides/getting-started.md)
213
+ - [Workspace authoring](docs/guides/workspace-authoring.md)
214
+ - [Integration guide](docs/guides/integration-guide.md)
215
+ - [Operator runbook](docs/guides/operator-runbook.md)
216
+ - [Adoption playbook](docs/product/adoption-playbook.md)
217
+ - [Market positioning](docs/product/market-positioning.md)
218
+
209
219
  ## Product Boundary
210
220
 
211
221
  Read these before adding public runtime behavior:
@@ -0,0 +1,171 @@
1
+ # stable-harness P0 Runtime Control Plane Plan
2
+
3
+ ## Goal
4
+
5
+ `stable-harness` must become the product runtime and operator control plane for
6
+ agent workspaces. It must not recreate upstream agent execution semantics.
7
+
8
+ This plan closes the highest-priority gaps against `agent-harness` by adding
9
+ typed stable runtime capabilities that are independently pluggable, testable,
10
+ and replaceable.
11
+
12
+ ## Non-Goals
13
+
14
+ - Do not copy `agent-harness` internal module layout.
15
+ - Do not add EasyNet-specific routing, prompt, ticker, news, Kubernetes, QA, or
16
+ release heuristics to core runtime.
17
+ - Do not parse TODO text to create tool calls.
18
+ - Do not recreate DeepAgents planning, delegation, virtual filesystem, or
19
+ upstream tool-call semantics.
20
+ - Do not make compatibility code the native design target.
21
+
22
+ ## Capability Classification
23
+
24
+ Useful `agent-harness` behavior must be reclassified before implementation:
25
+
26
+ - Upstream execution semantics: expose through backend adapter passthrough.
27
+ - Runtime/control-plane semantics: implement as stable typed capability.
28
+ - Downstream application logic: keep in workspace config, tools, or prompts.
29
+ - Historical workaround: keep in explicit compat path or delete.
30
+
31
+ ## Milestone P0-1: Durable Runtime Stores
32
+
33
+ ### Scope
34
+
35
+ Add the minimal stable persistence boundary for runtime records:
36
+
37
+ - `RuntimeStore` interface.
38
+ - In-memory implementation.
39
+ - JSON file implementation for local durable development.
40
+ - Runtime integration through dependency injection.
41
+ - Store-backed run, event, artifact, and state updates.
42
+
43
+ ### Acceptance
44
+
45
+ - Existing runtime behavior is unchanged when no store is provided.
46
+ - A file-backed runtime can survive runtime recreation.
47
+ - Events are appended through the store.
48
+ - Artifacts remain request-scoped.
49
+ - No source file exceeds project size limits.
50
+
51
+ ### Tests
52
+
53
+ - Unit tests for in-memory store.
54
+ - Unit tests for JSON file store reload.
55
+ - Runtime request test with injected file store.
56
+ - `npm run check`.
57
+ - `npm run check:rules`.
58
+ - `npm test`.
59
+
60
+ ## Milestone P0-2: Operator Inspection Contract
61
+
62
+ ### Scope
63
+
64
+ Extend native inspection without copying legacy structures:
65
+
66
+ - Session summary.
67
+ - Request summary.
68
+ - Request detail.
69
+ - Runtime snapshot for workspace/agent/model/tool binding.
70
+ - Event timeline projection.
71
+ - Artifact listing projection.
72
+
73
+ ### Acceptance
74
+
75
+ - Operators can inspect a request without reading raw run internals.
76
+ - Protocol client can fetch request/session projections.
77
+ - Existing `inspect()` remains backward-compatible.
78
+
79
+ ### Tests
80
+
81
+ - Runtime inspection unit tests.
82
+ - HTTP/in-process protocol tests.
83
+ - Existing runtime and trace tests.
84
+
85
+ ## Milestone P0-3: Queue and Recovery Core
86
+
87
+ ### Scope
88
+
89
+ Add request lifecycle primitives:
90
+
91
+ - Durable queue record.
92
+ - Priority and queue key.
93
+ - Claim/lease/heartbeat.
94
+ - Cancel intent.
95
+ - Stuck request detection.
96
+ - Recovery intent record.
97
+
98
+ ### Acceptance
99
+
100
+ - Queue logic is typed and domain-neutral.
101
+ - Recovery decisions come from runtime state, not prompt text.
102
+ - Running requests can be inspected and reconciled after restart.
103
+
104
+ ### Tests
105
+
106
+ - Queue lease and expiration tests.
107
+ - Heartbeat/stuck detection tests.
108
+ - Cancel intent tests.
109
+ - Recovery intent persistence tests.
110
+
111
+ ## Milestone P0-4: Artifacts, Evidence, and Replay Bundles
112
+
113
+ ### Scope
114
+
115
+ Make evidence first-class:
116
+
117
+ - Durable artifact content store.
118
+ - Artifact listing and read APIs.
119
+ - Evaluation bundle export.
120
+ - Replay bundle validation.
121
+ - Hash/metadata checks for stored artifacts.
122
+
123
+ ### Acceptance
124
+
125
+ - Request evidence can be exported from runtime records alone.
126
+ - Replay bundle contains runs, events, artifacts, and runtime metadata.
127
+ - Artifact APIs are independent from backend adapters.
128
+
129
+ ### Tests
130
+
131
+ - Artifact create/list/read tests.
132
+ - Evaluation bundle export tests.
133
+ - Replay validation tests.
134
+ - Existing EasyNet native stable tests.
135
+
136
+ ## Milestone P0-5: Native NL Migration Gate
137
+
138
+ ### Scope
139
+
140
+ Move natural-language execution toward native stable runtime:
141
+
142
+ - Keep DeepAgents as upstream execution source of truth.
143
+ - Preserve trace/evidence visibility through stable events.
144
+ - Keep compatibility facade explicit.
145
+ - Remove native dependence on compatibility runner when upstream behavior is
146
+ stable enough.
147
+
148
+ ### Acceptance
149
+
150
+ - EasyNet typed native gates pass.
151
+ - EasyNet natural-language matrix has a native stable path or documented
152
+ upstream blocker.
153
+ - No downstream-specific runtime heuristics are added.
154
+
155
+ ### Tests
156
+
157
+ - Stable native CLI tests.
158
+ - EasyNet native stable tests.
159
+ - EasyNet full matrix as migration gate.
160
+
161
+ ## Execution Rules
162
+
163
+ Each milestone must follow this loop:
164
+
165
+ 1. Inspect existing code and tests.
166
+ 2. Implement the smallest typed stable capability.
167
+ 3. Add focused tests.
168
+ 4. Run `npm run check`.
169
+ 5. Run `npm run check:rules`.
170
+ 6. Run `npm test`.
171
+ 7. Only then continue to the next milestone.
@@ -0,0 +1,87 @@
1
+ # Retry Policy
2
+
3
+ `stable-harness` 的 retry 只覆盖生产稳定性,不接管 DeepAgents 的执行语义。
4
+
5
+ ## 边界
6
+
7
+ | 场景 | 归属 | 行为 |
8
+ | --- | --- | --- |
9
+ | 模型 API timeout、rate limit、5xx | runtime policy | 通过 LangChain `modelRetryMiddleware` 重试同一次 model call |
10
+ | 工具后端 timeout、临时网络失败 | runtime policy | 通过 LangChain `toolRetryMiddleware` 重试同一次 tool call |
11
+ | tool 参数类型错、枚举错、语义错 | tool gateway guard | 返回 `ToolMessage(status="error")`,让模型下一轮自我修复 |
12
+ | 模型看到 ToolMessage 后是否重新调用 tool | DeepAgents/LangChain agent loop | 不做确定性保证,需要 benchmark |
13
+
14
+ ## YAML
15
+
16
+ ```yaml
17
+ apiVersion: stable-harness.dev/v1
18
+ kind: Runtime
19
+ metadata:
20
+ name: production
21
+ spec:
22
+ routing:
23
+ defaultAgentId: orchestra
24
+ retry:
25
+ model:
26
+ enabled: true
27
+ maxRetries: 3
28
+ initialDelayMs: 1000
29
+ backoffFactor: 2
30
+ jitter: true
31
+ onFailure: continue
32
+ tools:
33
+ enabled: true
34
+ tools:
35
+ - web_search
36
+ - fetch_url
37
+ retryOn:
38
+ - timeout
39
+ - network
40
+ - rateLimit
41
+ - serverError
42
+ maxRetries: 2
43
+ initialDelayMs: 500
44
+ backoffFactor: 2
45
+ jitter: true
46
+ onFailure: continue
47
+ ```
48
+
49
+ `retryOn` 是稳定 runtime 的字符串白名单,不接收 JavaScript function。默认等价于 `timeout/network/rateLimit/serverError`。`tools` 建议只配置会遇到临时失败的外部 I/O 工具。不要把 schema/参数修复放到 retry policy 里;参数修复由 `ToolArgumentGuard` 生成可读错误并交回模型。
50
+
51
+ ## Sequence
52
+
53
+ ```mermaid
54
+ sequenceDiagram
55
+ participant User
56
+ participant Runtime as stable-harness runtime
57
+ participant Adapter as DeepAgents adapter
58
+ participant LC as LangChain retry middleware
59
+ participant DA as DeepAgents agent loop
60
+ participant Tool as Tool gateway
61
+ participant Model as LLM
62
+
63
+ User->>Runtime: request
64
+ Runtime->>Adapter: run with retry policy
65
+ Adapter->>LC: install model/tool retry middleware
66
+ Adapter->>DA: createDeepAgent(params)
67
+ DA->>Model: model call
68
+ alt transient model failure
69
+ LC->>Model: retry model call
70
+ end
71
+ Model->>DA: tool call
72
+ DA->>Tool: invoke
73
+ alt transient tool failure
74
+ LC->>Tool: retry same tool call
75
+ else argument validation failure
76
+ Tool-->>DA: ToolMessage(status="error")
77
+ DA->>Model: feed error observation
78
+ Model->>DA: maybe corrected tool call
79
+ end
80
+ DA-->>Runtime: final result
81
+ ```
82
+
83
+ ## 当前测试
84
+
85
+ - `DeepAgents retry policy is translated to upstream middleware`:确认 `Runtime.spec.retry` 被 adapter 翻译成 LangChain 官方 middleware。
86
+ - `DeepAgents tool retry policy retries transient gateway failures`:真实 DeepAgents + fake tool-calling model,第一次 gateway tool 抛临时错误,第二次重试成功。
87
+ - `npm run benchmark:retry-policy`:对比 retry policy on/off 的成功率、attempts 和耗时。
@@ -0,0 +1,393 @@
1
+ # Stable Harness Runtime 改造路线图
2
+
3
+ 本文档定义 `stable-harness` 后续改造的逐步开发清单。目标是把 `stable-harness` 建成干净的 runtime / operator control plane,而不是复刻 `agent-harness` 或把 EasyNet 业务规则写入 runtime。
4
+
5
+ ## 硬性执行规则
6
+
7
+ - 每完成一个开发步骤,必须运行 EasyNet 完整真实验证。
8
+ - EasyNet 验证必须连接真实模型,运行真实 workspace、真实 tools、真实数据路径。
9
+ - 每完成一个开发步骤,必须在 `docs/` 下新增一份中文报告。
10
+ - 每份步骤报告必须包含:
11
+ - 改造目标
12
+ - 代码改动范围
13
+ - runtime boundary 判断
14
+ - EasyNet 真实测试命令和结果
15
+ - 失败、重试、风险和残留问题
16
+ - sequence diagram
17
+ - flow chart
18
+ - 不允许为了通过 EasyNet case 把 EasyNet 业务规则写入 `stable-harness` runtime。
19
+ - `runtime/compat` 和 `compat/*` 只能作为迁移路径,不能承载 native runtime 新能力。
20
+
21
+ ## 每步统一验证门槛
22
+
23
+ 在每个步骤结束前,至少运行:
24
+
25
+ ```bash
26
+ npm run check
27
+ npm run check:rules
28
+ npm test
29
+ ```
30
+
31
+ 在 EasyNet 中运行:
32
+
33
+ ```bash
34
+ npm test
35
+ npm run test:botbotgo:full
36
+ ```
37
+
38
+ 必要时增加定向过滤重试:
39
+
40
+ ```bash
41
+ EASYNET_FULL_MATRIX_FILTER=<case_id> npm run test:botbotgo:full
42
+ ```
43
+
44
+ 测试记录必须说明:
45
+
46
+ - 使用的 workspace:`/Users/boqiangliang/project/easynet`
47
+ - 使用的 runtime:`stable-harness -> file:../stable-harness`
48
+ - 是否连接真实模型
49
+ - 是否执行真实工具
50
+ - 是否存在未跟踪文件、外部服务失败、模型波动或 Kubernetes 环境限制
51
+
52
+ ## 开发步骤清单
53
+
54
+ ### 1. RunStore / EventStore / ArtifactStore
55
+
56
+ 目标:
57
+
58
+ - 把当前 core runtime 内部的 in-memory `Map` 拆成可替换 store interface。
59
+ - runtime 只依赖 store contract,不直接绑定内存实现。
60
+
61
+ 交付:
62
+
63
+ - `RunStore`
64
+ - `EventStore`
65
+ - `ArtifactStore`
66
+ - in-memory store implementation
67
+ - core runtime 使用 store
68
+ - store-focused tests
69
+
70
+ 禁止:
71
+
72
+ - 不引入 SQLite 作为第一步默认实现。
73
+ - 不复制 `agent-harness` 的 persistence 结构。
74
+
75
+ 验收:
76
+
77
+ - runtime inspection 仍可返回 run/event/artifact。
78
+ - EasyNet 完整真实测试通过。
79
+
80
+ ### 2. Runtime Event Model 标准化
81
+
82
+ 目标:
83
+
84
+ - 把工具事件、delegation 事件、approval 事件、artifact 事件统一成稳定 event envelope。
85
+ - CLI 和 protocols 消费统一事件,不直接消费 compat runner 的临时 delta。
86
+
87
+ 交付:
88
+
89
+ - `RuntimeEventEnvelope`
90
+ - typed event payloads
91
+ - event projection helpers
92
+ - event store append/read tests
93
+
94
+ 禁止:
95
+
96
+ - 不解析 TODO 文本生成事件。
97
+ - 不写 EasyNet specialist 事件特例。
98
+
99
+ 验收:
100
+
101
+ - EasyNet CLI trace 仍显示 delegation、tool start/result、TODO trace。
102
+ - 完整真实测试通过。
103
+
104
+ ### 3. ToolGateway 接入 Runtime
105
+
106
+ 目标:
107
+
108
+ - 把 native tool execution 从直接调用迁到 `@stable-harness/tool-gateway`。
109
+ - compat runner 只在迁移阶段保留 direct invocation。
110
+
111
+ 交付:
112
+
113
+ - runtime-level tool gateway injection
114
+ - tool invocation context
115
+ - tool start/result/error events
116
+ - schema validation hook
117
+ - focused gateway tests
118
+
119
+ 禁止:
120
+
121
+ - 不在 gateway 内做自然语言工具选择。
122
+ - 不把 finance/k8s/git/qa 等工具规则写入 gateway。
123
+
124
+ 验收:
125
+
126
+ - EasyNet 所有 specialist tools 仍通过真实工具执行。
127
+ - 完整真实测试通过。
128
+
129
+ ### 4. Governance / Approval Queue
130
+
131
+ 目标:
132
+
133
+ - 把 approval、sandbox、resource limit 变成独立 runtime capability。
134
+ - 工具执行前可基于 typed policy 进入 approval queue。
135
+
136
+ 交付:
137
+
138
+ - `ApprovalQueue`
139
+ - `GovernanceDecision`
140
+ - policy evaluation events
141
+ - allow/deny/resume lifecycle
142
+ - tests for allow / require approval / deny
143
+
144
+ 禁止:
145
+
146
+ - 不用 prompt 文本判断是否需要 approval。
147
+ - 不让 adapter 自己实现 approval lifecycle。
148
+
149
+ 验收:
150
+
151
+ - EasyNet 默认无 approval 阻塞,完整真实测试通过。
152
+ - 新 approval tests 证明 capability 可关闭、可替换。
153
+
154
+ ### 5. DeepAgents Native Path 对齐
155
+
156
+ 目标:
157
+
158
+ - 让 native DeepAgents adapter 成为默认设计目标。
159
+ - `createDeepAgent` 参数尽量 passthrough upstream primitives。
160
+
161
+ 交付:
162
+
163
+ - model config passthrough
164
+ - tools passthrough or gateway bridge
165
+ - subagents passthrough
166
+ - memory / skills passthrough
167
+ - upstream event normalization
168
+ - DeepAgents capability audit report
169
+
170
+ 禁止:
171
+
172
+ - 不重建 DeepAgents middleware stack。
173
+ - 不实现第二套 subagent planning language。
174
+ - 不 replay upstream custom tool calls。
175
+
176
+ 验收:
177
+
178
+ - native path tests 通过。
179
+ - EasyNet migration path 仍完整通过。
180
+
181
+ ### 6. Compat Runner 收缩
182
+
183
+ 目标:
184
+
185
+ - 把 `runtime/compat` 保持为 migration-only。
186
+ - 能迁出的能力迁到 native capability 或 upstream passthrough。
187
+
188
+ 交付:
189
+
190
+ - compat usage inventory
191
+ - migration blockers list
192
+ - compat-only behavior tags
193
+ - removal plan
194
+
195
+ 禁止:
196
+
197
+ - 不在 compat runner 中新增 native runtime 功能。
198
+ - 不把 compat API 扩展为产品 API。
199
+
200
+ 验收:
201
+
202
+ - EasyNet 完整真实测试通过。
203
+ - docs 中明确每个剩余 compat 行为的归宿。
204
+
205
+ ### 7. Protocol Surface
206
+
207
+ 目标:
208
+
209
+ - HTTP / in-process / future ACP / A2A / AG-UI 都调用同一 runtime contract。
210
+
211
+ 交付:
212
+
213
+ - request API
214
+ - event stream API
215
+ - run inspection API
216
+ - approval API
217
+ - artifact API
218
+ - protocol tests
219
+
220
+ 禁止:
221
+
222
+ - protocol 层不执行 agent。
223
+ - protocol 层不包含 backend 或 workspace 业务逻辑。
224
+
225
+ 验收:
226
+
227
+ - protocol tests 通过。
228
+ - EasyNet 完整真实测试通过。
229
+
230
+ ### 8. Replay / Evaluation
231
+
232
+ 目标:
233
+
234
+ - 基于 runtime events 和 artifacts 做 replay / eval,而不是重跑 prompt heuristics。
235
+
236
+ 交付:
237
+
238
+ - replay manifest
239
+ - evaluation fixture runner
240
+ - trace export
241
+ - artifact reference validation
242
+
243
+ 禁止:
244
+
245
+ - 不从自然语言输出反推工具调用。
246
+ - 不把某个 EasyNet case 的断言写成 runtime 规则。
247
+
248
+ 验收:
249
+
250
+ - replay tests 通过。
251
+ - EasyNet 完整真实测试通过。
252
+
253
+ ### 9. Memory Lifecycle
254
+
255
+ 目标:
256
+
257
+ - runtime 管 memory lifecycle,不替代 backend-native memory semantics。
258
+
259
+ 交付:
260
+
261
+ - memory namespace contract
262
+ - recall coordination events
263
+ - import/export hooks
264
+ - compaction hooks
265
+ - tests with in-memory store
266
+
267
+ 禁止:
268
+
269
+ - 不伪造统一记忆语义覆盖 DeepAgents native memory。
270
+ - 不在 memory capability 里做业务 routing。
271
+
272
+ 验收:
273
+
274
+ - memory tests 通过。
275
+ - EasyNet 完整真实测试通过。
276
+
277
+ ### 10. Native Stable Package Migration
278
+
279
+ 目标:
280
+
281
+ - EasyNet 从 explicit compat facade 迁到 stable native API。
282
+
283
+ 交付:
284
+
285
+ - EasyNet dependency migration plan
286
+ - native API usage examples
287
+ - compat fallback plan
288
+ - final migration test report
289
+
290
+ 禁止:
291
+
292
+ - 不为了迁移削弱 EasyNet contract tests。
293
+ - 不隐藏 JSON contract / specialist ownership / tool boundary。
294
+
295
+ 验收:
296
+
297
+ - EasyNet `npm test` 通过。
298
+ - EasyNet `npm run test:botbotgo:full` 通过。
299
+ - EasyNet 不再依赖 compat-only API,或明确列出最后 blockers。
300
+
301
+ ## 总体 Sequence Diagram
302
+
303
+ ```mermaid
304
+ sequenceDiagram
305
+ participant Dev as Developer
306
+ participant SH as stable-harness
307
+ participant EN as EasyNet
308
+ participant Model as Real Model
309
+ participant Tools as Real Tools/Data
310
+ participant Docs as docs/
311
+
312
+ Dev->>SH: Implement one runtime capability
313
+ SH->>SH: npm run check
314
+ SH->>SH: npm run check:rules
315
+ SH->>SH: npm test
316
+ Dev->>EN: npm test
317
+ EN->>Model: Real model calls
318
+ EN->>Tools: Real tool/data execution
319
+ Dev->>EN: npm run test:botbotgo:full
320
+ EN->>Model: Full matrix real model calls
321
+ EN->>Tools: Full matrix real tools/data
322
+ Dev->>Docs: Write Chinese step report
323
+ Docs->>Docs: Include sequence diagram and flow chart
324
+ Dev->>SH: Commit only after all gates pass
325
+ ```
326
+
327
+ ## 总体 Flow Chart
328
+
329
+ ```mermaid
330
+ flowchart TD
331
+ A["Pick next runtime capability"] --> B["Classify boundary"]
332
+ B --> C{"Who owns the behavior?"}
333
+ C -->|Upstream framework| D["Adapter passthrough"]
334
+ C -->|Runtime/control plane| E["Typed stable capability"]
335
+ C -->|Downstream app| F["Workspace config/tool/test"]
336
+ C -->|Historical workaround| G["Compat only or delete"]
337
+ D --> H["Implement narrow change"]
338
+ E --> H
339
+ F --> H
340
+ G --> H
341
+ H --> I["Run stable-harness checks/tests"]
342
+ I --> J["Run EasyNet npm test"]
343
+ J --> K["Run EasyNet full botbotgo matrix"]
344
+ K --> L{"All real gates pass?"}
345
+ L -->|No| M["Fix at correct boundary"]
346
+ M --> B
347
+ L -->|Yes| N["Write Chinese docs report"]
348
+ N --> O["Commit and push"]
349
+ ```
350
+
351
+ ## 每步报告模板
352
+
353
+ 每个步骤完成后,在 `docs/` 新增:
354
+
355
+ ```text
356
+ 0.1.0-step-<number>-<short-name>.zh.md
357
+ ```
358
+
359
+ 模板:
360
+
361
+ ```markdown
362
+ # Step <number>: <name>
363
+
364
+ ## 目标
365
+
366
+ ## 改动范围
367
+
368
+ ## Runtime Boundary 判断
369
+
370
+ ## 实现细节
371
+
372
+ ## EasyNet 真实测试
373
+
374
+ ### 命令
375
+
376
+ ### 结果
377
+
378
+ ### 模型和数据说明
379
+
380
+ ## 风险和残留问题
381
+
382
+ ## Sequence Diagram
383
+
384
+ ```mermaid
385
+ sequenceDiagram
386
+ ```
387
+
388
+ ## Flow Chart
389
+
390
+ ```mermaid
391
+ flowchart TD
392
+ ```
393
+ ```
@@ -0,0 +1,42 @@
1
+ # 0.1.0 Tool Guard Benchmark
2
+
3
+ 生成时间:2026-05-07T00:40:17.180Z
4
+
5
+ ## 测试设置
6
+
7
+ - 远端 Ollama:`https://ollama-rtx-4070.easynet.world`
8
+ - 每个模型自然用例轮数:`10`,总自然用例数为 `50`
9
+ - 注入错误矩阵覆盖:未知工具、错误工具名、缺必填、类型错、enum 错、extra arg、绝对路径、语义 ticker 错、不可解析参数
10
+ - 该 benchmark 是产品级 fault-injection 与本地 BFCL-style 子集,不是 BFCL 官方成绩。
11
+
12
+ ## 自然工具调用
13
+
14
+ | 模型 | Repair | 自然用例数 | Exact | Baseline Accepted | Bad Exec 无 Guard | Bad Exec 有 Guard | Final Accepted |
15
+ | --- | --- | --- | --- | --- | --- | --- | --- |
16
+ | qwen3:0.6b | off | 50 | 80% | 80% | 20% | 0% | 80% |
17
+ | qwen3:0.6b | on | 50 | 80% | 80% | 20% | 0% | 80% |
18
+ | qwen3.5:0.8b | off | 50 | 100% | 100% | 0% | 0% | 100% |
19
+ | qwen3.5:0.8b | on | 50 | 100% | 100% | 0% | 0% | 100% |
20
+ | qwen3.5:2b | off | 50 | 100% | 100% | 0% | 0% | 100% |
21
+ | qwen3.5:2b | on | 50 | 100% | 100% | 0% | 0% | 100% |
22
+ | granite4.1:3b | off | 50 | 100% | 100% | 0% | 0% | 100% |
23
+ | granite4.1:3b | on | 50 | 100% | 100% | 0% | 0% | 100% |
24
+ | qwen3.5:4b | off | 50 | 100% | 100% | 0% | 0% | 100% |
25
+ | qwen3.5:4b | on | 50 | 100% | 100% | 0% | 0% | 100% |
26
+
27
+ ## 注入错误矩阵
28
+
29
+ | 模型 | 注入错误 Guard 拦截 | 注入错误 Repair 成功 | 覆盖错误类型 |
30
+ | --- | --- | --- | --- |
31
+ | qwen3:0.6b | 100% | 66.7% | name, schema, type, semantic |
32
+ | qwen3.5:0.8b | 100% | 66.7% | name, schema, type, semantic |
33
+ | qwen3.5:2b | 100% | 100% | name, schema, type, semantic |
34
+ | granite4.1:3b | 100% | 100% | name, schema, type, semantic |
35
+ | qwen3.5:4b | 100% | 100% | name, schema, type, semantic |
36
+
37
+ ## 结论
38
+
39
+ - Guard 的核心收益是阻止错误 tool call 进入真实执行层;在本轮测试里,所有注入错误都被 100% 拦截。
40
+ - `qwen3:0.6b` 的自然输出存在 20% 原本会错误执行的 registered tool call,开启 Guard 后 bad execution 从 20% 降到 0%。
41
+ - `qwen3.5:2b`、`granite4.1:3b`、`qwen3.5:4b` 对注入错误的一轮 repair 成功率为 100%。这个结论只适用于本 benchmark 的注入错误矩阵。
42
+ - `qwen3.5:0.8b` 及以上在本轮自然用例里 baseline 已经是 100%,所以自然场景没有可观察的 accepted-rate uplift。