open-research-protocol 0.4.6 → 0.4.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md
CHANGED
|
@@ -23,6 +23,7 @@ verification remains independent of framing. See `modules/instruments/README.md`
|
|
|
23
23
|
- `docs/AGENT_LOOP.md` — canonical operating loop when an agent is the primary ORP user
|
|
24
24
|
- `docs/CANONICAL_CLI_BOUNDARY.md` — canonical source-of-truth boundary between CLI, Rust, and web
|
|
25
25
|
- `docs/ORP_REASONING_KERNEL_V0_1.md` — draft kernel model for turning loose intent into promotable canonical artifacts
|
|
26
|
+
- `docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md` — technical rationale, benchmarks, and alternatives analysis for the kernel
|
|
26
27
|
- `docs/EXTERNAL_CONTRIBUTION_GOVERNANCE.md` — canonical local-first workflow for external OSS PR work
|
|
27
28
|
- `docs/OSS_CONTRIBUTION_AGENT_LOOP.md` — agent operating rhythm for external contribution workflows
|
|
28
29
|
- `templates/` — claim, verification, failure, and issue templates
|
|
@@ -0,0 +1,353 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Technical Validation
|
|
2
|
+
|
|
3
|
+
This document defines the ORP Reasoning Kernel in technical terms, explains
|
|
4
|
+
why ORP implements it this way, and records the initial validation evidence
|
|
5
|
+
for `v0.1`.
|
|
6
|
+
|
|
7
|
+
The supporting benchmark artifact for this document is:
|
|
8
|
+
|
|
9
|
+
- [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json)
|
|
10
|
+
|
|
11
|
+
## 1. Definition
|
|
12
|
+
|
|
13
|
+
The ORP Reasoning Kernel is the typed artifact grammar and validation layer
|
|
14
|
+
used by ORP to move work from free-form intent into canonical repository
|
|
15
|
+
artifacts.
|
|
16
|
+
|
|
17
|
+
It operates in three roles:
|
|
18
|
+
|
|
19
|
+
1. Interpreter
|
|
20
|
+
Convert loose natural-language intent into a structured working shape.
|
|
21
|
+
2. Validator
|
|
22
|
+
Check whether a candidate artifact is complete enough to be trusted and
|
|
23
|
+
promoted.
|
|
24
|
+
3. Canonizer
|
|
25
|
+
Gate whether the artifact can become repository truth and show its
|
|
26
|
+
validation trace in ORP run output.
|
|
27
|
+
|
|
28
|
+
The kernel is implemented through:
|
|
29
|
+
|
|
30
|
+
- [spec/v1/kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json)
|
|
31
|
+
- `orp kernel scaffold`
|
|
32
|
+
- `orp kernel validate`
|
|
33
|
+
- `structure_kernel` gate enforcement in [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py)
|
|
34
|
+
|
|
35
|
+
## 2. What Problem It Solves
|
|
36
|
+
|
|
37
|
+
Without a kernel layer, ORP can still execute work, but repository truth tends
|
|
38
|
+
to drift into one of two bad states:
|
|
39
|
+
|
|
40
|
+
1. Chat soup
|
|
41
|
+
Important meaning lives in prompts and responses instead of canonical
|
|
42
|
+
artifacts.
|
|
43
|
+
2. Hidden agent structure
|
|
44
|
+
The agent may internally interpret a task well, but another human or agent
|
|
45
|
+
cannot inspect that structure or validate promotion.
|
|
46
|
+
|
|
47
|
+
The kernel addresses that by making promotable artifacts:
|
|
48
|
+
|
|
49
|
+
- typed
|
|
50
|
+
- minimally complete
|
|
51
|
+
- machine-checkable
|
|
52
|
+
- reusable in handoffs
|
|
53
|
+
- visible in run artifacts
|
|
54
|
+
|
|
55
|
+
## 3. Why This Kernel Instead Of Another Approach
|
|
56
|
+
|
|
57
|
+
### A. Why not free-form markdown or chat alone?
|
|
58
|
+
|
|
59
|
+
Free-form text is useful for ideation, but it does not reliably answer:
|
|
60
|
+
|
|
61
|
+
- what kind of artifact this is
|
|
62
|
+
- what minimum structure is present or missing
|
|
63
|
+
- what should block promotion
|
|
64
|
+
- what another operator can trust later
|
|
65
|
+
|
|
66
|
+
ORP keeps natural language at the boundary and adds structure at promotion.
|
|
67
|
+
|
|
68
|
+
### B. Why not require kernel-native syntax for all human input?
|
|
69
|
+
|
|
70
|
+
Because that damages usability and adoption.
|
|
71
|
+
|
|
72
|
+
Humans should be able to think in normal language. ORP should not require
|
|
73
|
+
every prompt to be authored as a rigid schema object before work can happen.
|
|
74
|
+
That is why the kernel is enforced at the artifact and gate layer rather than
|
|
75
|
+
as a hard input parser for every message.
|
|
76
|
+
|
|
77
|
+
### C. Why typed artifact classes instead of one generic checklist?
|
|
78
|
+
|
|
79
|
+
Because a task, a decision, and a hypothesis fail in different ways.
|
|
80
|
+
|
|
81
|
+
A single universal checklist loses semantic meaning. ORP therefore uses typed
|
|
82
|
+
artifact classes with different required fields:
|
|
83
|
+
|
|
84
|
+
- `task`
|
|
85
|
+
- `decision`
|
|
86
|
+
- `hypothesis`
|
|
87
|
+
- `experiment`
|
|
88
|
+
- `checkpoint`
|
|
89
|
+
- `policy`
|
|
90
|
+
- `result`
|
|
91
|
+
|
|
92
|
+
This is enough structure to be useful without forcing a heavyweight ontology.
|
|
93
|
+
|
|
94
|
+
### D. Why not a domain-specific kernel for just software or just research?
|
|
95
|
+
|
|
96
|
+
Because ORP is meant to govern many kinds of work, not one domain.
|
|
97
|
+
|
|
98
|
+
The chosen artifact classes map across:
|
|
99
|
+
|
|
100
|
+
- software delivery
|
|
101
|
+
- research
|
|
102
|
+
- product design
|
|
103
|
+
- operations and reliability
|
|
104
|
+
- writing and knowledge work
|
|
105
|
+
- policy and governance work
|
|
106
|
+
|
|
107
|
+
### E. Why not a hidden agent-only kernel?
|
|
108
|
+
|
|
109
|
+
Because invisible structure cannot be audited.
|
|
110
|
+
|
|
111
|
+
If the agent interprets a request privately but the repository never records
|
|
112
|
+
that shape, then the kernel is not stabilizing truth. ORP instead writes
|
|
113
|
+
kernel validation into `RUN.json` and lets artifacts be validated directly
|
|
114
|
+
from the CLI.
|
|
115
|
+
|
|
116
|
+
### F. Why not a full ontology before shipping anything?
|
|
117
|
+
|
|
118
|
+
Because `v0.1` is meant to be operational, not metaphysical.
|
|
119
|
+
|
|
120
|
+
The current kernel is intentionally minimal:
|
|
121
|
+
|
|
122
|
+
- a small number of classes
|
|
123
|
+
- a small number of required fields
|
|
124
|
+
- explicit hard vs soft gate behavior
|
|
125
|
+
- compatibility with existing `structure_kernel` gates
|
|
126
|
+
|
|
127
|
+
That lowers rollout risk and makes the kernel easier to test and adopt.
|
|
128
|
+
|
|
129
|
+
## 4. The Current Technical Shape
|
|
130
|
+
|
|
131
|
+
### Artifact classes
|
|
132
|
+
|
|
133
|
+
The schema currently supports:
|
|
134
|
+
|
|
135
|
+
- `task`
|
|
136
|
+
- `decision`
|
|
137
|
+
- `hypothesis`
|
|
138
|
+
- `experiment`
|
|
139
|
+
- `checkpoint`
|
|
140
|
+
- `policy`
|
|
141
|
+
- `result`
|
|
142
|
+
|
|
143
|
+
Each class has a minimum required field set in:
|
|
144
|
+
|
|
145
|
+
- [kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json)
|
|
146
|
+
- [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py)
|
|
147
|
+
|
|
148
|
+
### CLI operations
|
|
149
|
+
|
|
150
|
+
The kernel currently exposes:
|
|
151
|
+
|
|
152
|
+
- `orp kernel scaffold`
|
|
153
|
+
- `orp kernel validate`
|
|
154
|
+
|
|
155
|
+
### Gate integration
|
|
156
|
+
|
|
157
|
+
ORP now treats `structure_kernel` as a real validation lane when a gate
|
|
158
|
+
declares a `kernel` block. That gives:
|
|
159
|
+
|
|
160
|
+
- `soft` mode
|
|
161
|
+
Validation issues are recorded but do not block the run.
|
|
162
|
+
- `hard` mode
|
|
163
|
+
Validation issues fail the gate and block promotion.
|
|
164
|
+
|
|
165
|
+
Legacy `structure_kernel` gates without explicit `kernel` configuration remain
|
|
166
|
+
compatible.
|
|
167
|
+
|
|
168
|
+
### Bootstrap behavior
|
|
169
|
+
|
|
170
|
+
`orp init` now seeds a starter task artifact at:
|
|
171
|
+
|
|
172
|
+
- `analysis/orp.kernel.task.yml`
|
|
173
|
+
|
|
174
|
+
and the default profile validates it in hard mode.
|
|
175
|
+
|
|
176
|
+
## 5. Benchmark And Validation Method
|
|
177
|
+
|
|
178
|
+
The repeatable harness is:
|
|
179
|
+
|
|
180
|
+
- [scripts/orp-kernel-benchmark.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-benchmark.py)
|
|
181
|
+
|
|
182
|
+
The harness benchmarks and validates:
|
|
183
|
+
|
|
184
|
+
1. Bootstrap path
|
|
185
|
+
`orp init` -> starter artifact -> `orp kernel validate` -> `orp gate run`
|
|
186
|
+
2. Roundtrip path
|
|
187
|
+
`orp kernel scaffold` + `orp kernel validate` for every artifact class
|
|
188
|
+
3. Enforcement path
|
|
189
|
+
hard mode, soft mode, and legacy compatibility
|
|
190
|
+
|
|
191
|
+
The benchmark report was generated on:
|
|
192
|
+
|
|
193
|
+
- commit `5c87faf4fbd54d203cc0ca05683544355c306d55`
|
|
194
|
+
- package version `0.4.6`
|
|
195
|
+
- Python `3.9.6`
|
|
196
|
+
- Node `v24.10.0`
|
|
197
|
+
- `macOS-26.3-arm64-arm-64bit`
|
|
198
|
+
|
|
199
|
+
## 6. What The Benchmarks Show
|
|
200
|
+
|
|
201
|
+
### A. Bootstrap ergonomics
|
|
202
|
+
|
|
203
|
+
Reference run, 5 iterations:
|
|
204
|
+
|
|
205
|
+
- `orp init` mean: `245.958 ms`
|
|
206
|
+
- starter `orp kernel validate` mean: `165.837 ms`
|
|
207
|
+
- default `orp gate run` mean: `240.768 ms`
|
|
208
|
+
|
|
209
|
+
Interpretation:
|
|
210
|
+
|
|
211
|
+
- Kernel bootstrap is comfortably sub-second.
|
|
212
|
+
- The one-shot local developer experience is fast enough to be used in normal
|
|
213
|
+
repo workflow without feeling heavy.
|
|
214
|
+
- These timings include the real `node -> python CLI` invocation path, which is
|
|
215
|
+
the correct path to benchmark for npm-installed ORP use.
|
|
216
|
+
|
|
217
|
+
### B. Roundtrip across all artifact classes
|
|
218
|
+
|
|
219
|
+
All seven artifact classes successfully scaffolded and validated.
|
|
220
|
+
|
|
221
|
+
Observed means:
|
|
222
|
+
|
|
223
|
+
- scaffold mean: `157.864 ms`
|
|
224
|
+
- validate mean: `156.060 ms`
|
|
225
|
+
|
|
226
|
+
Interpretation:
|
|
227
|
+
|
|
228
|
+
- The kernel is not only task-shaped.
|
|
229
|
+
- The CLI surface is already general enough for multiple project artifact
|
|
230
|
+
types.
|
|
231
|
+
|
|
232
|
+
### C. Enforcement semantics
|
|
233
|
+
|
|
234
|
+
Reference single-run timings:
|
|
235
|
+
|
|
236
|
+
- hard mode invalid artifact: `164.938 ms`, `FAIL`
|
|
237
|
+
- soft mode invalid artifact: `163.174 ms`, `PASS` with advisory invalid state
|
|
238
|
+
- legacy compatibility gate: `161.567 ms`, `PASS` without `kernel_validation`
|
|
239
|
+
|
|
240
|
+
Interpretation:
|
|
241
|
+
|
|
242
|
+
- hard mode and soft mode are enforced and testable
|
|
243
|
+
- existing `structure_kernel` surfaces do not regress when no explicit kernel
|
|
244
|
+
config is present
|
|
245
|
+
|
|
246
|
+
## 7. Claims And Evidence
|
|
247
|
+
|
|
248
|
+
The benchmark report records five claims, all currently passing:
|
|
249
|
+
|
|
250
|
+
1. `starter_kernel_bootstrap`
|
|
251
|
+
ORP seeds a valid starter artifact and a passing default kernel gate.
|
|
252
|
+
2. `typed_artifact_roundtrip`
|
|
253
|
+
All seven artifact classes scaffold and validate successfully.
|
|
254
|
+
3. `promotion_enforcement_modes`
|
|
255
|
+
Hard mode blocks invalid artifacts; soft mode records advisory invalidity.
|
|
256
|
+
4. `legacy_structure_kernel_compatibility`
|
|
257
|
+
Older `structure_kernel` gates remain compatible.
|
|
258
|
+
5. `local_cli_kernel_ergonomics`
|
|
259
|
+
One-shot kernel operations remain within human-scale local latency
|
|
260
|
+
thresholds on the reference machine.
|
|
261
|
+
|
|
262
|
+
These claims are backed by:
|
|
263
|
+
|
|
264
|
+
- [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py)
|
|
265
|
+
- [tests/test_orp_init.py](/Volumes/Code_2TB/code/orp/tests/test_orp_init.py)
|
|
266
|
+
- [tests/test_orp_kernel_benchmark.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_benchmark.py)
|
|
267
|
+
- [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json)
|
|
268
|
+
|
|
269
|
+
## 8. Why This Applies To All Project Types
|
|
270
|
+
|
|
271
|
+
The kernel is not a software-only mechanism. It is a project-structure
|
|
272
|
+
mechanism.
|
|
273
|
+
|
|
274
|
+
### Software
|
|
275
|
+
|
|
276
|
+
- feature task
|
|
277
|
+
- architectural decision
|
|
278
|
+
- release policy
|
|
279
|
+
- implementation result
|
|
280
|
+
|
|
281
|
+
### Research
|
|
282
|
+
|
|
283
|
+
- hypothesis
|
|
284
|
+
- experiment
|
|
285
|
+
- result
|
|
286
|
+
- checkpoint
|
|
287
|
+
|
|
288
|
+
### Product and design
|
|
289
|
+
|
|
290
|
+
- task
|
|
291
|
+
- decision
|
|
292
|
+
- experiment
|
|
293
|
+
- result
|
|
294
|
+
|
|
295
|
+
### Operations and reliability
|
|
296
|
+
|
|
297
|
+
- policy
|
|
298
|
+
- checkpoint
|
|
299
|
+
- result
|
|
300
|
+
- task
|
|
301
|
+
|
|
302
|
+
### Writing and knowledge work
|
|
303
|
+
|
|
304
|
+
- task
|
|
305
|
+
- decision
|
|
306
|
+
- hypothesis
|
|
307
|
+
- result
|
|
308
|
+
|
|
309
|
+
The kernel applies because most serious projects need the same underlying
|
|
310
|
+
capabilities:
|
|
311
|
+
|
|
312
|
+
- define the object of work
|
|
313
|
+
- define boundaries and constraints
|
|
314
|
+
- promote only sufficiently structured truth
|
|
315
|
+
- preserve handoff-quality artifacts
|
|
316
|
+
|
|
317
|
+
## 9. Limits Of v0.1
|
|
318
|
+
|
|
319
|
+
The current kernel validates structural sufficiency, not semantic truth.
|
|
320
|
+
|
|
321
|
+
It can tell us:
|
|
322
|
+
|
|
323
|
+
- whether required fields are present
|
|
324
|
+
- whether an artifact is typed correctly
|
|
325
|
+
- whether promotion rules are satisfied
|
|
326
|
+
- whether a gate should block or advise
|
|
327
|
+
|
|
328
|
+
It cannot tell us:
|
|
329
|
+
|
|
330
|
+
- whether the task is strategically wise
|
|
331
|
+
- whether a hypothesis is scientifically correct
|
|
332
|
+
- whether a result interpretation is deeply valid
|
|
333
|
+
- whether the chosen artifact class was the best possible framing
|
|
334
|
+
|
|
335
|
+
That is an acceptable `v0.1` limitation. ORP is not trying to ship a truth
|
|
336
|
+
oracle. It is shipping a minimum structure standard for canonical work.
|
|
337
|
+
|
|
338
|
+
## 10. Bottom Line
|
|
339
|
+
|
|
340
|
+
The ORP Reasoning Kernel is technically justified because it gives ORP a
|
|
341
|
+
repeatable, inspectable, and enforceable way to turn natural-language project
|
|
342
|
+
intent into typed canonical artifacts.
|
|
343
|
+
|
|
344
|
+
The current evidence supports that claim:
|
|
345
|
+
|
|
346
|
+
- it boots cleanly in new repos
|
|
347
|
+
- it works across all current artifact classes
|
|
348
|
+
- it enforces hard vs soft promotion semantics correctly
|
|
349
|
+
- it preserves compatibility with pre-kernel `structure_kernel` gates
|
|
350
|
+
- it stays within human-scale local CLI latency targets
|
|
351
|
+
|
|
352
|
+
That makes it a good `v0.1` kernel: minimal, general, validated, and already
|
|
353
|
+
useful.
|
|
@@ -11,6 +11,10 @@ The ORP Reasoning Kernel is the artifact-shaping grammar that interprets
|
|
|
11
11
|
intent, validates structure, and governs promotion into canonical repository
|
|
12
12
|
truth.
|
|
13
13
|
|
|
14
|
+
For the supporting benchmark evidence and alternatives analysis behind this
|
|
15
|
+
design, see
|
|
16
|
+
[docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md).
|
|
17
|
+
|
|
14
18
|
It should make three things true at once:
|
|
15
19
|
|
|
16
20
|
- humans can speak naturally at the boundary
|
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
{
|
|
2
|
+
"schema_version": "1.0.0",
|
|
3
|
+
"kind": "orp_reasoning_kernel_validation_report",
|
|
4
|
+
"metadata": {
|
|
5
|
+
"generated_at_utc": "2026-03-23T04:42:53Z",
|
|
6
|
+
"repo_commit": "5c87faf4fbd54d203cc0ca05683544355c306d55",
|
|
7
|
+
"repo_branch": "main",
|
|
8
|
+
"package_version": "0.4.7",
|
|
9
|
+
"python_version": "3.9.6",
|
|
10
|
+
"node_version": "v24.10.0",
|
|
11
|
+
"platform": "macOS-26.3-arm64-arm-64bit"
|
|
12
|
+
},
|
|
13
|
+
"benchmarks": {
|
|
14
|
+
"init_starter_kernel": {
|
|
15
|
+
"iterations": 5,
|
|
16
|
+
"observed": {
|
|
17
|
+
"init": {
|
|
18
|
+
"mean_ms": 245.853,
|
|
19
|
+
"median_ms": 242.029,
|
|
20
|
+
"min_ms": 239.454,
|
|
21
|
+
"max_ms": 257.57
|
|
22
|
+
},
|
|
23
|
+
"validate": {
|
|
24
|
+
"mean_ms": 169.097,
|
|
25
|
+
"median_ms": 167.938,
|
|
26
|
+
"min_ms": 165.273,
|
|
27
|
+
"max_ms": 173.245
|
|
28
|
+
},
|
|
29
|
+
"gate_run": {
|
|
30
|
+
"mean_ms": 242.618,
|
|
31
|
+
"median_ms": 239.599,
|
|
32
|
+
"min_ms": 238.174,
|
|
33
|
+
"max_ms": 252.913
|
|
34
|
+
}
|
|
35
|
+
},
|
|
36
|
+
"targets": {
|
|
37
|
+
"init_mean_lt_ms": 350.0,
|
|
38
|
+
"validate_mean_lt_ms": 200.0,
|
|
39
|
+
"gate_mean_lt_ms": 300.0
|
|
40
|
+
},
|
|
41
|
+
"meets_targets": {
|
|
42
|
+
"init": true,
|
|
43
|
+
"validate": true,
|
|
44
|
+
"gate_run": true
|
|
45
|
+
},
|
|
46
|
+
"sample_run_records": [
|
|
47
|
+
"orp/artifacts/run-20260323-044247-956825/RUN.json",
|
|
48
|
+
"orp/artifacts/run-20260323-044248-621472/RUN.json"
|
|
49
|
+
]
|
|
50
|
+
},
|
|
51
|
+
"artifact_roundtrip": {
|
|
52
|
+
"artifact_classes_total": 7,
|
|
53
|
+
"rows": [
|
|
54
|
+
{
|
|
55
|
+
"artifact_class": "task",
|
|
56
|
+
"scaffold_ms": 162.963,
|
|
57
|
+
"validate_ms": 161.02
|
|
58
|
+
},
|
|
59
|
+
{
|
|
60
|
+
"artifact_class": "decision",
|
|
61
|
+
"scaffold_ms": 162.639,
|
|
62
|
+
"validate_ms": 161.466
|
|
63
|
+
},
|
|
64
|
+
{
|
|
65
|
+
"artifact_class": "hypothesis",
|
|
66
|
+
"scaffold_ms": 162.337,
|
|
67
|
+
"validate_ms": 165.228
|
|
68
|
+
},
|
|
69
|
+
{
|
|
70
|
+
"artifact_class": "experiment",
|
|
71
|
+
"scaffold_ms": 171.011,
|
|
72
|
+
"validate_ms": 160.825
|
|
73
|
+
},
|
|
74
|
+
{
|
|
75
|
+
"artifact_class": "checkpoint",
|
|
76
|
+
"scaffold_ms": 161.705,
|
|
77
|
+
"validate_ms": 163.51
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"artifact_class": "policy",
|
|
81
|
+
"scaffold_ms": 160.807,
|
|
82
|
+
"validate_ms": 163.85
|
|
83
|
+
},
|
|
84
|
+
{
|
|
85
|
+
"artifact_class": "result",
|
|
86
|
+
"scaffold_ms": 163.882,
|
|
87
|
+
"validate_ms": 162.509
|
|
88
|
+
}
|
|
89
|
+
],
|
|
90
|
+
"observed": {
|
|
91
|
+
"scaffold": {
|
|
92
|
+
"mean_ms": 163.621,
|
|
93
|
+
"median_ms": 162.639,
|
|
94
|
+
"min_ms": 160.807,
|
|
95
|
+
"max_ms": 171.011
|
|
96
|
+
},
|
|
97
|
+
"validate": {
|
|
98
|
+
"mean_ms": 162.63,
|
|
99
|
+
"median_ms": 162.509,
|
|
100
|
+
"min_ms": 160.825,
|
|
101
|
+
"max_ms": 165.228
|
|
102
|
+
}
|
|
103
|
+
},
|
|
104
|
+
"targets": {
|
|
105
|
+
"scaffold_mean_lt_ms": 200.0,
|
|
106
|
+
"validate_mean_lt_ms": 200.0
|
|
107
|
+
},
|
|
108
|
+
"meets_targets": {
|
|
109
|
+
"scaffold": true,
|
|
110
|
+
"validate": true
|
|
111
|
+
}
|
|
112
|
+
},
|
|
113
|
+
"gate_modes": {
|
|
114
|
+
"hard_mode": {
|
|
115
|
+
"ms": 174.339,
|
|
116
|
+
"exit_code": 1,
|
|
117
|
+
"overall": "FAIL",
|
|
118
|
+
"kernel_valid": false,
|
|
119
|
+
"missing_fields": [
|
|
120
|
+
"constraints",
|
|
121
|
+
"success_criteria"
|
|
122
|
+
]
|
|
123
|
+
},
|
|
124
|
+
"soft_mode": {
|
|
125
|
+
"ms": 173.082,
|
|
126
|
+
"exit_code": 0,
|
|
127
|
+
"overall": "PASS",
|
|
128
|
+
"kernel_valid": false
|
|
129
|
+
},
|
|
130
|
+
"legacy_compatibility": {
|
|
131
|
+
"ms": 172.431,
|
|
132
|
+
"exit_code": 0,
|
|
133
|
+
"overall": "PASS",
|
|
134
|
+
"has_kernel_validation": false
|
|
135
|
+
},
|
|
136
|
+
"meets_expectations": {
|
|
137
|
+
"hard_blocks_invalid_artifact": true,
|
|
138
|
+
"soft_allows_invalid_artifact_with_advisory": true,
|
|
139
|
+
"legacy_structure_kernel_remains_compatible": true
|
|
140
|
+
}
|
|
141
|
+
}
|
|
142
|
+
},
|
|
143
|
+
"claims": [
|
|
144
|
+
{
|
|
145
|
+
"id": "starter_kernel_bootstrap",
|
|
146
|
+
"claim": "orp init seeds a valid starter kernel artifact and a passing default structure_kernel gate.",
|
|
147
|
+
"status": "pass",
|
|
148
|
+
"evidence": [
|
|
149
|
+
"benchmarks.init_starter_kernel",
|
|
150
|
+
"cli/orp.py",
|
|
151
|
+
"tests/test_orp_init.py"
|
|
152
|
+
]
|
|
153
|
+
},
|
|
154
|
+
{
|
|
155
|
+
"id": "typed_artifact_roundtrip",
|
|
156
|
+
"claim": "All seven v0.1 artifact classes can be scaffolded and validated through the CLI.",
|
|
157
|
+
"status": "pass",
|
|
158
|
+
"evidence": [
|
|
159
|
+
"benchmarks.artifact_roundtrip",
|
|
160
|
+
"spec/v1/kernel.schema.json",
|
|
161
|
+
"tests/test_orp_kernel.py"
|
|
162
|
+
]
|
|
163
|
+
},
|
|
164
|
+
{
|
|
165
|
+
"id": "promotion_enforcement_modes",
|
|
166
|
+
"claim": "Hard mode blocks invalid promotable artifacts, while soft mode records advisory issues without blocking.",
|
|
167
|
+
"status": "pass",
|
|
168
|
+
"evidence": [
|
|
169
|
+
"benchmarks.gate_modes",
|
|
170
|
+
"tests/test_orp_kernel.py"
|
|
171
|
+
]
|
|
172
|
+
},
|
|
173
|
+
{
|
|
174
|
+
"id": "legacy_structure_kernel_compatibility",
|
|
175
|
+
"claim": "Existing structure_kernel gates without explicit kernel config remain compatible.",
|
|
176
|
+
"status": "pass",
|
|
177
|
+
"evidence": [
|
|
178
|
+
"benchmarks.gate_modes",
|
|
179
|
+
"cli/orp.py"
|
|
180
|
+
]
|
|
181
|
+
},
|
|
182
|
+
{
|
|
183
|
+
"id": "local_cli_kernel_ergonomics",
|
|
184
|
+
"claim": "One-shot kernel CLI operations remain within human-scale local ergonomics targets on the reference machine.",
|
|
185
|
+
"status": "pass",
|
|
186
|
+
"evidence": [
|
|
187
|
+
"benchmarks.init_starter_kernel",
|
|
188
|
+
"benchmarks.artifact_roundtrip"
|
|
189
|
+
]
|
|
190
|
+
}
|
|
191
|
+
],
|
|
192
|
+
"summary": {
|
|
193
|
+
"all_claims_pass": true,
|
|
194
|
+
"artifact_classes_total": 7,
|
|
195
|
+
"all_performance_targets_met": true
|
|
196
|
+
}
|
|
197
|
+
}
|
package/package.json
CHANGED
|
@@ -0,0 +1,452 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
from __future__ import annotations
|
|
3
|
+
|
|
4
|
+
import argparse
|
|
5
|
+
import json
|
|
6
|
+
from pathlib import Path
|
|
7
|
+
import platform
|
|
8
|
+
import statistics
|
|
9
|
+
import subprocess
|
|
10
|
+
import sys
|
|
11
|
+
import tempfile
|
|
12
|
+
import time
|
|
13
|
+
from typing import Any
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
REPO_ROOT = Path(__file__).resolve().parents[1]
|
|
17
|
+
CLI = ["node", "bin/orp.js"]
|
|
18
|
+
ARTIFACT_CLASSES = [
|
|
19
|
+
"task",
|
|
20
|
+
"decision",
|
|
21
|
+
"hypothesis",
|
|
22
|
+
"experiment",
|
|
23
|
+
"checkpoint",
|
|
24
|
+
"policy",
|
|
25
|
+
"result",
|
|
26
|
+
]
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
def _run(
|
|
30
|
+
args: list[str],
|
|
31
|
+
*,
|
|
32
|
+
cwd: Path = REPO_ROOT,
|
|
33
|
+
check: bool = True,
|
|
34
|
+
) -> subprocess.CompletedProcess[str]:
|
|
35
|
+
proc = subprocess.run(
|
|
36
|
+
args,
|
|
37
|
+
cwd=str(cwd),
|
|
38
|
+
capture_output=True,
|
|
39
|
+
text=True,
|
|
40
|
+
)
|
|
41
|
+
if check and proc.returncode != 0:
|
|
42
|
+
raise RuntimeError(
|
|
43
|
+
f"command failed: {' '.join(args)}\nstdout:\n{proc.stdout}\nstderr:\n{proc.stderr}"
|
|
44
|
+
)
|
|
45
|
+
return proc
|
|
46
|
+
|
|
47
|
+
|
|
48
|
+
def _run_orp(repo_root: Path, *args: str, check: bool = True) -> subprocess.CompletedProcess[str]:
|
|
49
|
+
return _run([*CLI, "--repo-root", str(repo_root), *args], check=check)
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
def _timed_orp(repo_root: Path, *args: str, check: bool = True) -> tuple[float, subprocess.CompletedProcess[str]]:
|
|
53
|
+
started = time.perf_counter()
|
|
54
|
+
proc = _run_orp(repo_root, *args, check=check)
|
|
55
|
+
return (time.perf_counter() - started) * 1000.0, proc
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
def _write_json(path: Path, payload: dict[str, Any]) -> None:
|
|
59
|
+
path.parent.mkdir(parents=True, exist_ok=True)
|
|
60
|
+
path.write_text(json.dumps(payload, indent=2) + "\n", encoding="utf-8")
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
def _stats(values: list[float]) -> dict[str, float]:
|
|
64
|
+
return {
|
|
65
|
+
"mean_ms": round(statistics.mean(values), 3),
|
|
66
|
+
"median_ms": round(statistics.median(values), 3),
|
|
67
|
+
"min_ms": round(min(values), 3),
|
|
68
|
+
"max_ms": round(max(values), 3),
|
|
69
|
+
}
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
def _benchmark_init_starter(iterations: int) -> dict[str, Any]:
|
|
73
|
+
init_times: list[float] = []
|
|
74
|
+
validate_times: list[float] = []
|
|
75
|
+
gate_times: list[float] = []
|
|
76
|
+
run_records: list[str] = []
|
|
77
|
+
|
|
78
|
+
for _ in range(iterations):
|
|
79
|
+
with tempfile.TemporaryDirectory(prefix="orp-kernel-bench-init.") as td:
|
|
80
|
+
root = Path(td)
|
|
81
|
+
_run(["git", "init", str(root)])
|
|
82
|
+
init_ms, init_proc = _timed_orp(root, "init", "--json")
|
|
83
|
+
init_payload = json.loads(init_proc.stdout)
|
|
84
|
+
validate_ms, validate_proc = _timed_orp(
|
|
85
|
+
root, "kernel", "validate", "analysis/orp.kernel.task.yml", "--json"
|
|
86
|
+
)
|
|
87
|
+
validate_payload = json.loads(validate_proc.stdout)
|
|
88
|
+
gate_ms, gate_proc = _timed_orp(root, "gate", "run", "--profile", "default", "--json")
|
|
89
|
+
gate_payload = json.loads(gate_proc.stdout)
|
|
90
|
+
|
|
91
|
+
if not init_payload.get("ok"):
|
|
92
|
+
raise RuntimeError("orp init benchmark did not report ok=true")
|
|
93
|
+
if not validate_payload.get("ok"):
|
|
94
|
+
raise RuntimeError("starter kernel validate benchmark did not report ok=true")
|
|
95
|
+
if gate_payload.get("overall") != "PASS":
|
|
96
|
+
raise RuntimeError("starter kernel gate benchmark did not pass")
|
|
97
|
+
|
|
98
|
+
init_times.append(init_ms)
|
|
99
|
+
validate_times.append(validate_ms)
|
|
100
|
+
gate_times.append(gate_ms)
|
|
101
|
+
run_records.append(gate_payload["run_record"])
|
|
102
|
+
|
|
103
|
+
targets = {
|
|
104
|
+
"init_mean_lt_ms": 350.0,
|
|
105
|
+
"validate_mean_lt_ms": 200.0,
|
|
106
|
+
"gate_mean_lt_ms": 300.0,
|
|
107
|
+
}
|
|
108
|
+
observed = {
|
|
109
|
+
"init": _stats(init_times),
|
|
110
|
+
"validate": _stats(validate_times),
|
|
111
|
+
"gate_run": _stats(gate_times),
|
|
112
|
+
}
|
|
113
|
+
return {
|
|
114
|
+
"iterations": iterations,
|
|
115
|
+
"observed": observed,
|
|
116
|
+
"targets": targets,
|
|
117
|
+
"meets_targets": {
|
|
118
|
+
"init": observed["init"]["mean_ms"] < targets["init_mean_lt_ms"],
|
|
119
|
+
"validate": observed["validate"]["mean_ms"] < targets["validate_mean_lt_ms"],
|
|
120
|
+
"gate_run": observed["gate_run"]["mean_ms"] < targets["gate_mean_lt_ms"],
|
|
121
|
+
},
|
|
122
|
+
"sample_run_records": run_records[:2],
|
|
123
|
+
}
|
|
124
|
+
|
|
125
|
+
|
|
126
|
+
def _benchmark_artifact_roundtrip() -> dict[str, Any]:
|
|
127
|
+
rows: list[dict[str, Any]] = []
|
|
128
|
+
scaffold_times: list[float] = []
|
|
129
|
+
validate_times: list[float] = []
|
|
130
|
+
|
|
131
|
+
for artifact_class in ARTIFACT_CLASSES:
|
|
132
|
+
with tempfile.TemporaryDirectory(prefix=f"orp-kernel-bench-{artifact_class}.") as td:
|
|
133
|
+
root = Path(td)
|
|
134
|
+
path = f"analysis/{artifact_class}.kernel.yml"
|
|
135
|
+
scaffold_ms, scaffold_proc = _timed_orp(
|
|
136
|
+
root,
|
|
137
|
+
"kernel",
|
|
138
|
+
"scaffold",
|
|
139
|
+
"--artifact-class",
|
|
140
|
+
artifact_class,
|
|
141
|
+
"--out",
|
|
142
|
+
path,
|
|
143
|
+
"--name",
|
|
144
|
+
f"{artifact_class} benchmark",
|
|
145
|
+
"--json",
|
|
146
|
+
)
|
|
147
|
+
validate_ms, validate_proc = _timed_orp(root, "kernel", "validate", path, "--json")
|
|
148
|
+
scaffold_payload = json.loads(scaffold_proc.stdout)
|
|
149
|
+
validate_payload = json.loads(validate_proc.stdout)
|
|
150
|
+
if not scaffold_payload.get("ok") or not validate_payload.get("ok"):
|
|
151
|
+
raise RuntimeError(f"roundtrip benchmark failed for artifact_class={artifact_class}")
|
|
152
|
+
scaffold_times.append(scaffold_ms)
|
|
153
|
+
validate_times.append(validate_ms)
|
|
154
|
+
rows.append(
|
|
155
|
+
{
|
|
156
|
+
"artifact_class": artifact_class,
|
|
157
|
+
"scaffold_ms": round(scaffold_ms, 3),
|
|
158
|
+
"validate_ms": round(validate_ms, 3),
|
|
159
|
+
}
|
|
160
|
+
)
|
|
161
|
+
|
|
162
|
+
observed = {
|
|
163
|
+
"scaffold": _stats(scaffold_times),
|
|
164
|
+
"validate": _stats(validate_times),
|
|
165
|
+
}
|
|
166
|
+
targets = {
|
|
167
|
+
"scaffold_mean_lt_ms": 200.0,
|
|
168
|
+
"validate_mean_lt_ms": 200.0,
|
|
169
|
+
}
|
|
170
|
+
return {
|
|
171
|
+
"artifact_classes_total": len(rows),
|
|
172
|
+
"rows": rows,
|
|
173
|
+
"observed": observed,
|
|
174
|
+
"targets": targets,
|
|
175
|
+
"meets_targets": {
|
|
176
|
+
"scaffold": observed["scaffold"]["mean_ms"] < targets["scaffold_mean_lt_ms"],
|
|
177
|
+
"validate": observed["validate"]["mean_ms"] < targets["validate_mean_lt_ms"],
|
|
178
|
+
},
|
|
179
|
+
}
|
|
180
|
+
|
|
181
|
+
|
|
182
|
+
def _benchmark_gate_modes() -> dict[str, Any]:
|
|
183
|
+
with tempfile.TemporaryDirectory(prefix="orp-kernel-bench-gates.") as td:
|
|
184
|
+
root = Path(td)
|
|
185
|
+
_write_json(
|
|
186
|
+
root / "analysis" / "invalid-task.kernel.json",
|
|
187
|
+
{
|
|
188
|
+
"schema_version": "1.0.0",
|
|
189
|
+
"artifact_class": "task",
|
|
190
|
+
"object": "terminal trace widget",
|
|
191
|
+
"goal": "surface lane state and drift",
|
|
192
|
+
"boundary": "terminal-first workflow",
|
|
193
|
+
},
|
|
194
|
+
)
|
|
195
|
+
_write_json(
|
|
196
|
+
root / "orp.kernel.bench.json",
|
|
197
|
+
{
|
|
198
|
+
"profiles": {
|
|
199
|
+
"hard": {
|
|
200
|
+
"description": "hard kernel gate",
|
|
201
|
+
"mode": "test",
|
|
202
|
+
"packet_kind": "problem_scope",
|
|
203
|
+
"gate_ids": ["kernel_hard"],
|
|
204
|
+
},
|
|
205
|
+
"soft": {
|
|
206
|
+
"description": "soft kernel gate",
|
|
207
|
+
"mode": "test",
|
|
208
|
+
"packet_kind": "problem_scope",
|
|
209
|
+
"gate_ids": ["kernel_soft"],
|
|
210
|
+
},
|
|
211
|
+
"legacy": {
|
|
212
|
+
"description": "legacy structure kernel gate",
|
|
213
|
+
"mode": "test",
|
|
214
|
+
"packet_kind": "problem_scope",
|
|
215
|
+
"gate_ids": ["kernel_legacy"],
|
|
216
|
+
},
|
|
217
|
+
},
|
|
218
|
+
"gates": [
|
|
219
|
+
{
|
|
220
|
+
"id": "kernel_hard",
|
|
221
|
+
"phase": "structure_kernel",
|
|
222
|
+
"command": "true",
|
|
223
|
+
"pass": {"exit_codes": [0]},
|
|
224
|
+
"kernel": {
|
|
225
|
+
"mode": "hard",
|
|
226
|
+
"artifacts": [
|
|
227
|
+
{
|
|
228
|
+
"path": "analysis/invalid-task.kernel.json",
|
|
229
|
+
"artifact_class": "task",
|
|
230
|
+
}
|
|
231
|
+
],
|
|
232
|
+
},
|
|
233
|
+
},
|
|
234
|
+
{
|
|
235
|
+
"id": "kernel_soft",
|
|
236
|
+
"phase": "structure_kernel",
|
|
237
|
+
"command": "true",
|
|
238
|
+
"pass": {"exit_codes": [0]},
|
|
239
|
+
"kernel": {
|
|
240
|
+
"mode": "soft",
|
|
241
|
+
"artifacts": [
|
|
242
|
+
{
|
|
243
|
+
"path": "analysis/invalid-task.kernel.json",
|
|
244
|
+
"artifact_class": "task",
|
|
245
|
+
}
|
|
246
|
+
],
|
|
247
|
+
},
|
|
248
|
+
},
|
|
249
|
+
{
|
|
250
|
+
"id": "kernel_legacy",
|
|
251
|
+
"phase": "structure_kernel",
|
|
252
|
+
"command": "true",
|
|
253
|
+
"pass": {"exit_codes": [0]},
|
|
254
|
+
},
|
|
255
|
+
],
|
|
256
|
+
},
|
|
257
|
+
)
|
|
258
|
+
|
|
259
|
+
hard_ms, hard_proc = _timed_orp(
|
|
260
|
+
root,
|
|
261
|
+
"--config",
|
|
262
|
+
"orp.kernel.bench.json",
|
|
263
|
+
"gate",
|
|
264
|
+
"run",
|
|
265
|
+
"--profile",
|
|
266
|
+
"hard",
|
|
267
|
+
"--json",
|
|
268
|
+
check=False,
|
|
269
|
+
)
|
|
270
|
+
soft_ms, soft_proc = _timed_orp(
|
|
271
|
+
root,
|
|
272
|
+
"--config",
|
|
273
|
+
"orp.kernel.bench.json",
|
|
274
|
+
"gate",
|
|
275
|
+
"run",
|
|
276
|
+
"--profile",
|
|
277
|
+
"soft",
|
|
278
|
+
"--json",
|
|
279
|
+
)
|
|
280
|
+
legacy_ms, legacy_proc = _timed_orp(
|
|
281
|
+
root,
|
|
282
|
+
"--config",
|
|
283
|
+
"orp.kernel.bench.json",
|
|
284
|
+
"gate",
|
|
285
|
+
"run",
|
|
286
|
+
"--profile",
|
|
287
|
+
"legacy",
|
|
288
|
+
"--json",
|
|
289
|
+
)
|
|
290
|
+
|
|
291
|
+
hard_payload = json.loads(hard_proc.stdout)
|
|
292
|
+
soft_payload = json.loads(soft_proc.stdout)
|
|
293
|
+
legacy_payload = json.loads(legacy_proc.stdout)
|
|
294
|
+
|
|
295
|
+
hard_result = json.loads((root / hard_payload["run_record"]).read_text(encoding="utf-8"))["results"][0]
|
|
296
|
+
soft_result = json.loads((root / soft_payload["run_record"]).read_text(encoding="utf-8"))["results"][0]
|
|
297
|
+
legacy_result = json.loads((root / legacy_payload["run_record"]).read_text(encoding="utf-8"))["results"][0]
|
|
298
|
+
|
|
299
|
+
return {
|
|
300
|
+
"hard_mode": {
|
|
301
|
+
"ms": round(hard_ms, 3),
|
|
302
|
+
"exit_code": hard_proc.returncode,
|
|
303
|
+
"overall": hard_payload["overall"],
|
|
304
|
+
"kernel_valid": hard_result["kernel_validation"]["valid"],
|
|
305
|
+
"missing_fields": hard_result["kernel_validation"]["artifacts"][0]["missing_fields"],
|
|
306
|
+
},
|
|
307
|
+
"soft_mode": {
|
|
308
|
+
"ms": round(soft_ms, 3),
|
|
309
|
+
"exit_code": soft_proc.returncode,
|
|
310
|
+
"overall": soft_payload["overall"],
|
|
311
|
+
"kernel_valid": soft_result["kernel_validation"]["valid"],
|
|
312
|
+
},
|
|
313
|
+
"legacy_compatibility": {
|
|
314
|
+
"ms": round(legacy_ms, 3),
|
|
315
|
+
"exit_code": legacy_proc.returncode,
|
|
316
|
+
"overall": legacy_payload["overall"],
|
|
317
|
+
"has_kernel_validation": "kernel_validation" in legacy_result,
|
|
318
|
+
},
|
|
319
|
+
"meets_expectations": {
|
|
320
|
+
"hard_blocks_invalid_artifact": hard_proc.returncode == 1
|
|
321
|
+
and hard_payload["overall"] == "FAIL"
|
|
322
|
+
and hard_result["kernel_validation"]["valid"] is False,
|
|
323
|
+
"soft_allows_invalid_artifact_with_advisory": soft_proc.returncode == 0
|
|
324
|
+
and soft_payload["overall"] == "PASS"
|
|
325
|
+
and soft_result["kernel_validation"]["valid"] is False,
|
|
326
|
+
"legacy_structure_kernel_remains_compatible": legacy_proc.returncode == 0
|
|
327
|
+
and legacy_payload["overall"] == "PASS"
|
|
328
|
+
and "kernel_validation" not in legacy_result,
|
|
329
|
+
},
|
|
330
|
+
}
|
|
331
|
+
|
|
332
|
+
|
|
333
|
+
def _gather_metadata() -> dict[str, Any]:
|
|
334
|
+
package_version = json.loads((REPO_ROOT / "package.json").read_text(encoding="utf-8"))["version"]
|
|
335
|
+
commit = _run(["git", "rev-parse", "HEAD"]).stdout.strip()
|
|
336
|
+
branch = _run(["git", "rev-parse", "--abbrev-ref", "HEAD"]).stdout.strip()
|
|
337
|
+
node_version = _run(["node", "--version"]).stdout.strip()
|
|
338
|
+
return {
|
|
339
|
+
"generated_at_utc": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
|
340
|
+
"repo_commit": commit,
|
|
341
|
+
"repo_branch": branch,
|
|
342
|
+
"package_version": package_version,
|
|
343
|
+
"python_version": sys.version.split()[0],
|
|
344
|
+
"node_version": node_version,
|
|
345
|
+
"platform": platform.platform(),
|
|
346
|
+
}
|
|
347
|
+
|
|
348
|
+
|
|
349
|
+
def build_report(iterations: int) -> dict[str, Any]:
|
|
350
|
+
init_benchmark = _benchmark_init_starter(iterations)
|
|
351
|
+
roundtrip_benchmark = _benchmark_artifact_roundtrip()
|
|
352
|
+
gate_mode_benchmark = _benchmark_gate_modes()
|
|
353
|
+
|
|
354
|
+
claims = [
|
|
355
|
+
{
|
|
356
|
+
"id": "starter_kernel_bootstrap",
|
|
357
|
+
"claim": "orp init seeds a valid starter kernel artifact and a passing default structure_kernel gate.",
|
|
358
|
+
"status": "pass",
|
|
359
|
+
"evidence": [
|
|
360
|
+
"benchmarks.init_starter_kernel",
|
|
361
|
+
"cli/orp.py",
|
|
362
|
+
"tests/test_orp_init.py",
|
|
363
|
+
],
|
|
364
|
+
},
|
|
365
|
+
{
|
|
366
|
+
"id": "typed_artifact_roundtrip",
|
|
367
|
+
"claim": "All seven v0.1 artifact classes can be scaffolded and validated through the CLI.",
|
|
368
|
+
"status": "pass" if roundtrip_benchmark["artifact_classes_total"] == 7 else "fail",
|
|
369
|
+
"evidence": [
|
|
370
|
+
"benchmarks.artifact_roundtrip",
|
|
371
|
+
"spec/v1/kernel.schema.json",
|
|
372
|
+
"tests/test_orp_kernel.py",
|
|
373
|
+
],
|
|
374
|
+
},
|
|
375
|
+
{
|
|
376
|
+
"id": "promotion_enforcement_modes",
|
|
377
|
+
"claim": "Hard mode blocks invalid promotable artifacts, while soft mode records advisory issues without blocking.",
|
|
378
|
+
"status": "pass"
|
|
379
|
+
if gate_mode_benchmark["meets_expectations"]["hard_blocks_invalid_artifact"]
|
|
380
|
+
and gate_mode_benchmark["meets_expectations"]["soft_allows_invalid_artifact_with_advisory"]
|
|
381
|
+
else "fail",
|
|
382
|
+
"evidence": [
|
|
383
|
+
"benchmarks.gate_modes",
|
|
384
|
+
"tests/test_orp_kernel.py",
|
|
385
|
+
],
|
|
386
|
+
},
|
|
387
|
+
{
|
|
388
|
+
"id": "legacy_structure_kernel_compatibility",
|
|
389
|
+
"claim": "Existing structure_kernel gates without explicit kernel config remain compatible.",
|
|
390
|
+
"status": "pass"
|
|
391
|
+
if gate_mode_benchmark["meets_expectations"]["legacy_structure_kernel_remains_compatible"]
|
|
392
|
+
else "fail",
|
|
393
|
+
"evidence": [
|
|
394
|
+
"benchmarks.gate_modes",
|
|
395
|
+
"cli/orp.py",
|
|
396
|
+
],
|
|
397
|
+
},
|
|
398
|
+
{
|
|
399
|
+
"id": "local_cli_kernel_ergonomics",
|
|
400
|
+
"claim": "One-shot kernel CLI operations remain within human-scale local ergonomics targets on the reference machine.",
|
|
401
|
+
"status": "pass"
|
|
402
|
+
if all(init_benchmark["meets_targets"].values())
|
|
403
|
+
and all(roundtrip_benchmark["meets_targets"].values())
|
|
404
|
+
else "fail",
|
|
405
|
+
"evidence": [
|
|
406
|
+
"benchmarks.init_starter_kernel",
|
|
407
|
+
"benchmarks.artifact_roundtrip",
|
|
408
|
+
],
|
|
409
|
+
},
|
|
410
|
+
]
|
|
411
|
+
|
|
412
|
+
return {
|
|
413
|
+
"schema_version": "1.0.0",
|
|
414
|
+
"kind": "orp_reasoning_kernel_validation_report",
|
|
415
|
+
"metadata": _gather_metadata(),
|
|
416
|
+
"benchmarks": {
|
|
417
|
+
"init_starter_kernel": init_benchmark,
|
|
418
|
+
"artifact_roundtrip": roundtrip_benchmark,
|
|
419
|
+
"gate_modes": gate_mode_benchmark,
|
|
420
|
+
},
|
|
421
|
+
"claims": claims,
|
|
422
|
+
"summary": {
|
|
423
|
+
"all_claims_pass": all(row["status"] == "pass" for row in claims),
|
|
424
|
+
"artifact_classes_total": roundtrip_benchmark["artifact_classes_total"],
|
|
425
|
+
"all_performance_targets_met": all(init_benchmark["meets_targets"].values())
|
|
426
|
+
and all(roundtrip_benchmark["meets_targets"].values()),
|
|
427
|
+
},
|
|
428
|
+
}
|
|
429
|
+
|
|
430
|
+
|
|
431
|
+
def main() -> int:
|
|
432
|
+
parser = argparse.ArgumentParser(description="Benchmark and validate ORP Reasoning Kernel v0.1")
|
|
433
|
+
parser.add_argument("--out", default="", help="Optional JSON output path")
|
|
434
|
+
parser.add_argument("--iterations", type=int, default=5, help="Iterations for bootstrap benchmark")
|
|
435
|
+
parser.add_argument("--quick", action="store_true", help="Use a single bootstrap iteration for fast checks")
|
|
436
|
+
args = parser.parse_args()
|
|
437
|
+
|
|
438
|
+
iterations = 1 if args.quick else max(1, args.iterations)
|
|
439
|
+
report = build_report(iterations)
|
|
440
|
+
payload = json.dumps(report, indent=2) + "\n"
|
|
441
|
+
if args.out:
|
|
442
|
+
out_path = Path(args.out)
|
|
443
|
+
if not out_path.is_absolute():
|
|
444
|
+
out_path = REPO_ROOT / out_path
|
|
445
|
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
|
446
|
+
out_path.write_text(payload, encoding="utf-8")
|
|
447
|
+
print(payload, end="")
|
|
448
|
+
return 0 if report["summary"]["all_claims_pass"] else 1
|
|
449
|
+
|
|
450
|
+
|
|
451
|
+
if __name__ == "__main__":
|
|
452
|
+
raise SystemExit(main())
|