open-research-protocol 0.4.6 → 0.4.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -23,6 +23,7 @@ verification remains independent of framing. See `modules/instruments/README.md`
23
23
  - `docs/AGENT_LOOP.md` — canonical operating loop when an agent is the primary ORP user
24
24
  - `docs/CANONICAL_CLI_BOUNDARY.md` — canonical source-of-truth boundary between CLI, Rust, and web
25
25
  - `docs/ORP_REASONING_KERNEL_V0_1.md` — draft kernel model for turning loose intent into promotable canonical artifacts
26
+ - `docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md` — technical rationale, benchmarks, and alternatives analysis for the kernel
26
27
  - `docs/EXTERNAL_CONTRIBUTION_GOVERNANCE.md` — canonical local-first workflow for external OSS PR work
27
28
  - `docs/OSS_CONTRIBUTION_AGENT_LOOP.md` — agent operating rhythm for external contribution workflows
28
29
  - `templates/` — claim, verification, failure, and issue templates
@@ -0,0 +1,353 @@
1
+ # ORP Reasoning Kernel Technical Validation
2
+
3
+ This document defines the ORP Reasoning Kernel in technical terms, explains
4
+ why ORP implements it this way, and records the initial validation evidence
5
+ for `v0.1`.
6
+
7
+ The supporting benchmark artifact for this document is:
8
+
9
+ - [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json)
10
+
11
+ ## 1. Definition
12
+
13
+ The ORP Reasoning Kernel is the typed artifact grammar and validation layer
14
+ used by ORP to move work from free-form intent into canonical repository
15
+ artifacts.
16
+
17
+ It operates in three roles:
18
+
19
+ 1. Interpreter
20
+ Convert loose natural-language intent into a structured working shape.
21
+ 2. Validator
22
+ Check whether a candidate artifact is complete enough to be trusted and
23
+ promoted.
24
+ 3. Canonizer
25
+ Gate whether the artifact can become repository truth and show its
26
+ validation trace in ORP run output.
27
+
28
+ The kernel is implemented through:
29
+
30
+ - [spec/v1/kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json)
31
+ - `orp kernel scaffold`
32
+ - `orp kernel validate`
33
+ - `structure_kernel` gate enforcement in [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py)
34
+
35
+ ## 2. What Problem It Solves
36
+
37
+ Without a kernel layer, ORP can still execute work, but repository truth tends
38
+ to drift into one of two bad states:
39
+
40
+ 1. Chat soup
41
+ Important meaning lives in prompts and responses instead of canonical
42
+ artifacts.
43
+ 2. Hidden agent structure
44
+ The agent may internally interpret a task well, but another human or agent
45
+ cannot inspect that structure or validate promotion.
46
+
47
+ The kernel addresses that by making promotable artifacts:
48
+
49
+ - typed
50
+ - minimally complete
51
+ - machine-checkable
52
+ - reusable in handoffs
53
+ - visible in run artifacts
54
+
55
+ ## 3. Why This Kernel Instead Of Another Approach
56
+
57
+ ### A. Why not free-form markdown or chat alone?
58
+
59
+ Free-form text is useful for ideation, but it does not reliably answer:
60
+
61
+ - what kind of artifact this is
62
+ - what minimum structure is present or missing
63
+ - what should block promotion
64
+ - what another operator can trust later
65
+
66
+ ORP keeps natural language at the boundary and adds structure at promotion.
67
+
68
+ ### B. Why not require kernel-native syntax for all human input?
69
+
70
+ Because that damages usability and adoption.
71
+
72
+ Humans should be able to think in normal language. ORP should not require
73
+ every prompt to be authored as a rigid schema object before work can happen.
74
+ That is why the kernel is enforced at the artifact and gate layer rather than
75
+ as a hard input parser for every message.
76
+
77
+ ### C. Why typed artifact classes instead of one generic checklist?
78
+
79
+ Because a task, a decision, and a hypothesis fail in different ways.
80
+
81
+ A single universal checklist loses semantic meaning. ORP therefore uses typed
82
+ artifact classes with different required fields:
83
+
84
+ - `task`
85
+ - `decision`
86
+ - `hypothesis`
87
+ - `experiment`
88
+ - `checkpoint`
89
+ - `policy`
90
+ - `result`
91
+
92
+ This is enough structure to be useful without forcing a heavyweight ontology.
93
+
94
+ ### D. Why not a domain-specific kernel for just software or just research?
95
+
96
+ Because ORP is meant to govern many kinds of work, not one domain.
97
+
98
+ The chosen artifact classes map across:
99
+
100
+ - software delivery
101
+ - research
102
+ - product design
103
+ - operations and reliability
104
+ - writing and knowledge work
105
+ - policy and governance work
106
+
107
+ ### E. Why not a hidden agent-only kernel?
108
+
109
+ Because invisible structure cannot be audited.
110
+
111
+ If the agent interprets a request privately but the repository never records
112
+ that shape, then the kernel is not stabilizing truth. ORP instead writes
113
+ kernel validation into `RUN.json` and lets artifacts be validated directly
114
+ from the CLI.
115
+
116
+ ### F. Why not a full ontology before shipping anything?
117
+
118
+ Because `v0.1` is meant to be operational, not metaphysical.
119
+
120
+ The current kernel is intentionally minimal:
121
+
122
+ - a small number of classes
123
+ - a small number of required fields
124
+ - explicit hard vs soft gate behavior
125
+ - compatibility with existing `structure_kernel` gates
126
+
127
+ That lowers rollout risk and makes the kernel easier to test and adopt.
128
+
129
+ ## 4. The Current Technical Shape
130
+
131
+ ### Artifact classes
132
+
133
+ The schema currently supports:
134
+
135
+ - `task`
136
+ - `decision`
137
+ - `hypothesis`
138
+ - `experiment`
139
+ - `checkpoint`
140
+ - `policy`
141
+ - `result`
142
+
143
+ Each class has a minimum required field set in:
144
+
145
+ - [kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json)
146
+ - [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py)
147
+
148
+ ### CLI operations
149
+
150
+ The kernel currently exposes:
151
+
152
+ - `orp kernel scaffold`
153
+ - `orp kernel validate`
154
+
155
+ ### Gate integration
156
+
157
+ ORP now treats `structure_kernel` as a real validation lane when a gate
158
+ declares a `kernel` block. That gives:
159
+
160
+ - `soft` mode
161
+ Validation issues are recorded but do not block the run.
162
+ - `hard` mode
163
+ Validation issues fail the gate and block promotion.
164
+
165
+ Legacy `structure_kernel` gates without explicit `kernel` configuration remain
166
+ compatible.
167
+
168
+ ### Bootstrap behavior
169
+
170
+ `orp init` now seeds a starter task artifact at:
171
+
172
+ - `analysis/orp.kernel.task.yml`
173
+
174
+ and the default profile validates it in hard mode.
175
+
176
+ ## 5. Benchmark And Validation Method
177
+
178
+ The repeatable harness is:
179
+
180
+ - [scripts/orp-kernel-benchmark.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-benchmark.py)
181
+
182
+ The harness benchmarks and validates:
183
+
184
+ 1. Bootstrap path
185
+ `orp init` -> starter artifact -> `orp kernel validate` -> `orp gate run`
186
+ 2. Roundtrip path
187
+ `orp kernel scaffold` + `orp kernel validate` for every artifact class
188
+ 3. Enforcement path
189
+ hard mode, soft mode, and legacy compatibility
190
+
191
+ The benchmark report was generated on:
192
+
193
+ - commit `5c87faf4fbd54d203cc0ca05683544355c306d55`
194
+ - package version `0.4.6`
195
+ - Python `3.9.6`
196
+ - Node `v24.10.0`
197
+ - `macOS-26.3-arm64-arm-64bit`
198
+
199
+ ## 6. What The Benchmarks Show
200
+
201
+ ### A. Bootstrap ergonomics
202
+
203
+ Reference run, 5 iterations:
204
+
205
+ - `orp init` mean: `245.958 ms`
206
+ - starter `orp kernel validate` mean: `165.837 ms`
207
+ - default `orp gate run` mean: `240.768 ms`
208
+
209
+ Interpretation:
210
+
211
+ - Kernel bootstrap is comfortably sub-second.
212
+ - The one-shot local developer experience is fast enough to be used in normal
213
+ repo workflow without feeling heavy.
214
+ - These timings include the real `node -> python CLI` invocation path, which is
215
+ the correct path to benchmark for npm-installed ORP use.
216
+
217
+ ### B. Roundtrip across all artifact classes
218
+
219
+ All seven artifact classes successfully scaffolded and validated.
220
+
221
+ Observed means:
222
+
223
+ - scaffold mean: `157.864 ms`
224
+ - validate mean: `156.060 ms`
225
+
226
+ Interpretation:
227
+
228
+ - The kernel is not only task-shaped.
229
+ - The CLI surface is already general enough for multiple project artifact
230
+ types.
231
+
232
+ ### C. Enforcement semantics
233
+
234
+ Reference single-run timings:
235
+
236
+ - hard mode invalid artifact: `164.938 ms`, `FAIL`
237
+ - soft mode invalid artifact: `163.174 ms`, `PASS` with advisory invalid state
238
+ - legacy compatibility gate: `161.567 ms`, `PASS` without `kernel_validation`
239
+
240
+ Interpretation:
241
+
242
+ - hard mode and soft mode are enforced and testable
243
+ - existing `structure_kernel` surfaces do not regress when no explicit kernel
244
+ config is present
245
+
246
+ ## 7. Claims And Evidence
247
+
248
+ The benchmark report records five claims, all currently passing:
249
+
250
+ 1. `starter_kernel_bootstrap`
251
+ ORP seeds a valid starter artifact and a passing default kernel gate.
252
+ 2. `typed_artifact_roundtrip`
253
+ All seven artifact classes scaffold and validate successfully.
254
+ 3. `promotion_enforcement_modes`
255
+ Hard mode blocks invalid artifacts; soft mode records advisory invalidity.
256
+ 4. `legacy_structure_kernel_compatibility`
257
+ Older `structure_kernel` gates remain compatible.
258
+ 5. `local_cli_kernel_ergonomics`
259
+ One-shot kernel operations remain within human-scale local latency
260
+ thresholds on the reference machine.
261
+
262
+ These claims are backed by:
263
+
264
+ - [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py)
265
+ - [tests/test_orp_init.py](/Volumes/Code_2TB/code/orp/tests/test_orp_init.py)
266
+ - [tests/test_orp_kernel_benchmark.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_benchmark.py)
267
+ - [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json)
268
+
269
+ ## 8. Why This Applies To All Project Types
270
+
271
+ The kernel is not a software-only mechanism. It is a project-structure
272
+ mechanism.
273
+
274
+ ### Software
275
+
276
+ - feature task
277
+ - architectural decision
278
+ - release policy
279
+ - implementation result
280
+
281
+ ### Research
282
+
283
+ - hypothesis
284
+ - experiment
285
+ - result
286
+ - checkpoint
287
+
288
+ ### Product and design
289
+
290
+ - task
291
+ - decision
292
+ - experiment
293
+ - result
294
+
295
+ ### Operations and reliability
296
+
297
+ - policy
298
+ - checkpoint
299
+ - result
300
+ - task
301
+
302
+ ### Writing and knowledge work
303
+
304
+ - task
305
+ - decision
306
+ - hypothesis
307
+ - result
308
+
309
+ The kernel applies because most serious projects need the same underlying
310
+ capabilities:
311
+
312
+ - define the object of work
313
+ - define boundaries and constraints
314
+ - promote only sufficiently structured truth
315
+ - preserve handoff-quality artifacts
316
+
317
+ ## 9. Limits Of v0.1
318
+
319
+ The current kernel validates structural sufficiency, not semantic truth.
320
+
321
+ It can tell us:
322
+
323
+ - whether required fields are present
324
+ - whether an artifact is typed correctly
325
+ - whether promotion rules are satisfied
326
+ - whether a gate should block or advise
327
+
328
+ It cannot tell us:
329
+
330
+ - whether the task is strategically wise
331
+ - whether a hypothesis is scientifically correct
332
+ - whether a result interpretation is deeply valid
333
+ - whether the chosen artifact class was the best possible framing
334
+
335
+ That is an acceptable `v0.1` limitation. ORP is not trying to ship a truth
336
+ oracle. It is shipping a minimum structure standard for canonical work.
337
+
338
+ ## 10. Bottom Line
339
+
340
+ The ORP Reasoning Kernel is technically justified because it gives ORP a
341
+ repeatable, inspectable, and enforceable way to turn natural-language project
342
+ intent into typed canonical artifacts.
343
+
344
+ The current evidence supports that claim:
345
+
346
+ - it boots cleanly in new repos
347
+ - it works across all current artifact classes
348
+ - it enforces hard vs soft promotion semantics correctly
349
+ - it preserves compatibility with pre-kernel `structure_kernel` gates
350
+ - it stays within human-scale local CLI latency targets
351
+
352
+ That makes it a good `v0.1` kernel: minimal, general, validated, and already
353
+ useful.
@@ -11,6 +11,10 @@ The ORP Reasoning Kernel is the artifact-shaping grammar that interprets
11
11
  intent, validates structure, and governs promotion into canonical repository
12
12
  truth.
13
13
 
14
+ For the supporting benchmark evidence and alternatives analysis behind this
15
+ design, see
16
+ [docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md).
17
+
14
18
  It should make three things true at once:
15
19
 
16
20
  - humans can speak naturally at the boundary
@@ -0,0 +1,197 @@
1
+ {
2
+ "schema_version": "1.0.0",
3
+ "kind": "orp_reasoning_kernel_validation_report",
4
+ "metadata": {
5
+ "generated_at_utc": "2026-03-23T04:42:53Z",
6
+ "repo_commit": "5c87faf4fbd54d203cc0ca05683544355c306d55",
7
+ "repo_branch": "main",
8
+ "package_version": "0.4.7",
9
+ "python_version": "3.9.6",
10
+ "node_version": "v24.10.0",
11
+ "platform": "macOS-26.3-arm64-arm-64bit"
12
+ },
13
+ "benchmarks": {
14
+ "init_starter_kernel": {
15
+ "iterations": 5,
16
+ "observed": {
17
+ "init": {
18
+ "mean_ms": 245.853,
19
+ "median_ms": 242.029,
20
+ "min_ms": 239.454,
21
+ "max_ms": 257.57
22
+ },
23
+ "validate": {
24
+ "mean_ms": 169.097,
25
+ "median_ms": 167.938,
26
+ "min_ms": 165.273,
27
+ "max_ms": 173.245
28
+ },
29
+ "gate_run": {
30
+ "mean_ms": 242.618,
31
+ "median_ms": 239.599,
32
+ "min_ms": 238.174,
33
+ "max_ms": 252.913
34
+ }
35
+ },
36
+ "targets": {
37
+ "init_mean_lt_ms": 350.0,
38
+ "validate_mean_lt_ms": 200.0,
39
+ "gate_mean_lt_ms": 300.0
40
+ },
41
+ "meets_targets": {
42
+ "init": true,
43
+ "validate": true,
44
+ "gate_run": true
45
+ },
46
+ "sample_run_records": [
47
+ "orp/artifacts/run-20260323-044247-956825/RUN.json",
48
+ "orp/artifacts/run-20260323-044248-621472/RUN.json"
49
+ ]
50
+ },
51
+ "artifact_roundtrip": {
52
+ "artifact_classes_total": 7,
53
+ "rows": [
54
+ {
55
+ "artifact_class": "task",
56
+ "scaffold_ms": 162.963,
57
+ "validate_ms": 161.02
58
+ },
59
+ {
60
+ "artifact_class": "decision",
61
+ "scaffold_ms": 162.639,
62
+ "validate_ms": 161.466
63
+ },
64
+ {
65
+ "artifact_class": "hypothesis",
66
+ "scaffold_ms": 162.337,
67
+ "validate_ms": 165.228
68
+ },
69
+ {
70
+ "artifact_class": "experiment",
71
+ "scaffold_ms": 171.011,
72
+ "validate_ms": 160.825
73
+ },
74
+ {
75
+ "artifact_class": "checkpoint",
76
+ "scaffold_ms": 161.705,
77
+ "validate_ms": 163.51
78
+ },
79
+ {
80
+ "artifact_class": "policy",
81
+ "scaffold_ms": 160.807,
82
+ "validate_ms": 163.85
83
+ },
84
+ {
85
+ "artifact_class": "result",
86
+ "scaffold_ms": 163.882,
87
+ "validate_ms": 162.509
88
+ }
89
+ ],
90
+ "observed": {
91
+ "scaffold": {
92
+ "mean_ms": 163.621,
93
+ "median_ms": 162.639,
94
+ "min_ms": 160.807,
95
+ "max_ms": 171.011
96
+ },
97
+ "validate": {
98
+ "mean_ms": 162.63,
99
+ "median_ms": 162.509,
100
+ "min_ms": 160.825,
101
+ "max_ms": 165.228
102
+ }
103
+ },
104
+ "targets": {
105
+ "scaffold_mean_lt_ms": 200.0,
106
+ "validate_mean_lt_ms": 200.0
107
+ },
108
+ "meets_targets": {
109
+ "scaffold": true,
110
+ "validate": true
111
+ }
112
+ },
113
+ "gate_modes": {
114
+ "hard_mode": {
115
+ "ms": 174.339,
116
+ "exit_code": 1,
117
+ "overall": "FAIL",
118
+ "kernel_valid": false,
119
+ "missing_fields": [
120
+ "constraints",
121
+ "success_criteria"
122
+ ]
123
+ },
124
+ "soft_mode": {
125
+ "ms": 173.082,
126
+ "exit_code": 0,
127
+ "overall": "PASS",
128
+ "kernel_valid": false
129
+ },
130
+ "legacy_compatibility": {
131
+ "ms": 172.431,
132
+ "exit_code": 0,
133
+ "overall": "PASS",
134
+ "has_kernel_validation": false
135
+ },
136
+ "meets_expectations": {
137
+ "hard_blocks_invalid_artifact": true,
138
+ "soft_allows_invalid_artifact_with_advisory": true,
139
+ "legacy_structure_kernel_remains_compatible": true
140
+ }
141
+ }
142
+ },
143
+ "claims": [
144
+ {
145
+ "id": "starter_kernel_bootstrap",
146
+ "claim": "orp init seeds a valid starter kernel artifact and a passing default structure_kernel gate.",
147
+ "status": "pass",
148
+ "evidence": [
149
+ "benchmarks.init_starter_kernel",
150
+ "cli/orp.py",
151
+ "tests/test_orp_init.py"
152
+ ]
153
+ },
154
+ {
155
+ "id": "typed_artifact_roundtrip",
156
+ "claim": "All seven v0.1 artifact classes can be scaffolded and validated through the CLI.",
157
+ "status": "pass",
158
+ "evidence": [
159
+ "benchmarks.artifact_roundtrip",
160
+ "spec/v1/kernel.schema.json",
161
+ "tests/test_orp_kernel.py"
162
+ ]
163
+ },
164
+ {
165
+ "id": "promotion_enforcement_modes",
166
+ "claim": "Hard mode blocks invalid promotable artifacts, while soft mode records advisory issues without blocking.",
167
+ "status": "pass",
168
+ "evidence": [
169
+ "benchmarks.gate_modes",
170
+ "tests/test_orp_kernel.py"
171
+ ]
172
+ },
173
+ {
174
+ "id": "legacy_structure_kernel_compatibility",
175
+ "claim": "Existing structure_kernel gates without explicit kernel config remain compatible.",
176
+ "status": "pass",
177
+ "evidence": [
178
+ "benchmarks.gate_modes",
179
+ "cli/orp.py"
180
+ ]
181
+ },
182
+ {
183
+ "id": "local_cli_kernel_ergonomics",
184
+ "claim": "One-shot kernel CLI operations remain within human-scale local ergonomics targets on the reference machine.",
185
+ "status": "pass",
186
+ "evidence": [
187
+ "benchmarks.init_starter_kernel",
188
+ "benchmarks.artifact_roundtrip"
189
+ ]
190
+ }
191
+ ],
192
+ "summary": {
193
+ "all_claims_pass": true,
194
+ "artifact_classes_total": 7,
195
+ "all_performance_targets_met": true
196
+ }
197
+ }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "open-research-protocol",
3
- "version": "0.4.6",
3
+ "version": "0.4.7",
4
4
  "description": "ORP CLI (Open Research Protocol): agent-friendly research workflows, runtime, reports, and pack tooling.",
5
5
  "license": "MIT",
6
6
  "repository": {
@@ -0,0 +1,452 @@
1
+ #!/usr/bin/env python3
2
+ from __future__ import annotations
3
+
4
+ import argparse
5
+ import json
6
+ from pathlib import Path
7
+ import platform
8
+ import statistics
9
+ import subprocess
10
+ import sys
11
+ import tempfile
12
+ import time
13
+ from typing import Any
14
+
15
+
16
+ REPO_ROOT = Path(__file__).resolve().parents[1]
17
+ CLI = ["node", "bin/orp.js"]
18
+ ARTIFACT_CLASSES = [
19
+ "task",
20
+ "decision",
21
+ "hypothesis",
22
+ "experiment",
23
+ "checkpoint",
24
+ "policy",
25
+ "result",
26
+ ]
27
+
28
+
29
+ def _run(
30
+ args: list[str],
31
+ *,
32
+ cwd: Path = REPO_ROOT,
33
+ check: bool = True,
34
+ ) -> subprocess.CompletedProcess[str]:
35
+ proc = subprocess.run(
36
+ args,
37
+ cwd=str(cwd),
38
+ capture_output=True,
39
+ text=True,
40
+ )
41
+ if check and proc.returncode != 0:
42
+ raise RuntimeError(
43
+ f"command failed: {' '.join(args)}\nstdout:\n{proc.stdout}\nstderr:\n{proc.stderr}"
44
+ )
45
+ return proc
46
+
47
+
48
+ def _run_orp(repo_root: Path, *args: str, check: bool = True) -> subprocess.CompletedProcess[str]:
49
+ return _run([*CLI, "--repo-root", str(repo_root), *args], check=check)
50
+
51
+
52
+ def _timed_orp(repo_root: Path, *args: str, check: bool = True) -> tuple[float, subprocess.CompletedProcess[str]]:
53
+ started = time.perf_counter()
54
+ proc = _run_orp(repo_root, *args, check=check)
55
+ return (time.perf_counter() - started) * 1000.0, proc
56
+
57
+
58
+ def _write_json(path: Path, payload: dict[str, Any]) -> None:
59
+ path.parent.mkdir(parents=True, exist_ok=True)
60
+ path.write_text(json.dumps(payload, indent=2) + "\n", encoding="utf-8")
61
+
62
+
63
+ def _stats(values: list[float]) -> dict[str, float]:
64
+ return {
65
+ "mean_ms": round(statistics.mean(values), 3),
66
+ "median_ms": round(statistics.median(values), 3),
67
+ "min_ms": round(min(values), 3),
68
+ "max_ms": round(max(values), 3),
69
+ }
70
+
71
+
72
+ def _benchmark_init_starter(iterations: int) -> dict[str, Any]:
73
+ init_times: list[float] = []
74
+ validate_times: list[float] = []
75
+ gate_times: list[float] = []
76
+ run_records: list[str] = []
77
+
78
+ for _ in range(iterations):
79
+ with tempfile.TemporaryDirectory(prefix="orp-kernel-bench-init.") as td:
80
+ root = Path(td)
81
+ _run(["git", "init", str(root)])
82
+ init_ms, init_proc = _timed_orp(root, "init", "--json")
83
+ init_payload = json.loads(init_proc.stdout)
84
+ validate_ms, validate_proc = _timed_orp(
85
+ root, "kernel", "validate", "analysis/orp.kernel.task.yml", "--json"
86
+ )
87
+ validate_payload = json.loads(validate_proc.stdout)
88
+ gate_ms, gate_proc = _timed_orp(root, "gate", "run", "--profile", "default", "--json")
89
+ gate_payload = json.loads(gate_proc.stdout)
90
+
91
+ if not init_payload.get("ok"):
92
+ raise RuntimeError("orp init benchmark did not report ok=true")
93
+ if not validate_payload.get("ok"):
94
+ raise RuntimeError("starter kernel validate benchmark did not report ok=true")
95
+ if gate_payload.get("overall") != "PASS":
96
+ raise RuntimeError("starter kernel gate benchmark did not pass")
97
+
98
+ init_times.append(init_ms)
99
+ validate_times.append(validate_ms)
100
+ gate_times.append(gate_ms)
101
+ run_records.append(gate_payload["run_record"])
102
+
103
+ targets = {
104
+ "init_mean_lt_ms": 350.0,
105
+ "validate_mean_lt_ms": 200.0,
106
+ "gate_mean_lt_ms": 300.0,
107
+ }
108
+ observed = {
109
+ "init": _stats(init_times),
110
+ "validate": _stats(validate_times),
111
+ "gate_run": _stats(gate_times),
112
+ }
113
+ return {
114
+ "iterations": iterations,
115
+ "observed": observed,
116
+ "targets": targets,
117
+ "meets_targets": {
118
+ "init": observed["init"]["mean_ms"] < targets["init_mean_lt_ms"],
119
+ "validate": observed["validate"]["mean_ms"] < targets["validate_mean_lt_ms"],
120
+ "gate_run": observed["gate_run"]["mean_ms"] < targets["gate_mean_lt_ms"],
121
+ },
122
+ "sample_run_records": run_records[:2],
123
+ }
124
+
125
+
126
+ def _benchmark_artifact_roundtrip() -> dict[str, Any]:
127
+ rows: list[dict[str, Any]] = []
128
+ scaffold_times: list[float] = []
129
+ validate_times: list[float] = []
130
+
131
+ for artifact_class in ARTIFACT_CLASSES:
132
+ with tempfile.TemporaryDirectory(prefix=f"orp-kernel-bench-{artifact_class}.") as td:
133
+ root = Path(td)
134
+ path = f"analysis/{artifact_class}.kernel.yml"
135
+ scaffold_ms, scaffold_proc = _timed_orp(
136
+ root,
137
+ "kernel",
138
+ "scaffold",
139
+ "--artifact-class",
140
+ artifact_class,
141
+ "--out",
142
+ path,
143
+ "--name",
144
+ f"{artifact_class} benchmark",
145
+ "--json",
146
+ )
147
+ validate_ms, validate_proc = _timed_orp(root, "kernel", "validate", path, "--json")
148
+ scaffold_payload = json.loads(scaffold_proc.stdout)
149
+ validate_payload = json.loads(validate_proc.stdout)
150
+ if not scaffold_payload.get("ok") or not validate_payload.get("ok"):
151
+ raise RuntimeError(f"roundtrip benchmark failed for artifact_class={artifact_class}")
152
+ scaffold_times.append(scaffold_ms)
153
+ validate_times.append(validate_ms)
154
+ rows.append(
155
+ {
156
+ "artifact_class": artifact_class,
157
+ "scaffold_ms": round(scaffold_ms, 3),
158
+ "validate_ms": round(validate_ms, 3),
159
+ }
160
+ )
161
+
162
+ observed = {
163
+ "scaffold": _stats(scaffold_times),
164
+ "validate": _stats(validate_times),
165
+ }
166
+ targets = {
167
+ "scaffold_mean_lt_ms": 200.0,
168
+ "validate_mean_lt_ms": 200.0,
169
+ }
170
+ return {
171
+ "artifact_classes_total": len(rows),
172
+ "rows": rows,
173
+ "observed": observed,
174
+ "targets": targets,
175
+ "meets_targets": {
176
+ "scaffold": observed["scaffold"]["mean_ms"] < targets["scaffold_mean_lt_ms"],
177
+ "validate": observed["validate"]["mean_ms"] < targets["validate_mean_lt_ms"],
178
+ },
179
+ }
180
+
181
+
182
+ def _benchmark_gate_modes() -> dict[str, Any]:
183
+ with tempfile.TemporaryDirectory(prefix="orp-kernel-bench-gates.") as td:
184
+ root = Path(td)
185
+ _write_json(
186
+ root / "analysis" / "invalid-task.kernel.json",
187
+ {
188
+ "schema_version": "1.0.0",
189
+ "artifact_class": "task",
190
+ "object": "terminal trace widget",
191
+ "goal": "surface lane state and drift",
192
+ "boundary": "terminal-first workflow",
193
+ },
194
+ )
195
+ _write_json(
196
+ root / "orp.kernel.bench.json",
197
+ {
198
+ "profiles": {
199
+ "hard": {
200
+ "description": "hard kernel gate",
201
+ "mode": "test",
202
+ "packet_kind": "problem_scope",
203
+ "gate_ids": ["kernel_hard"],
204
+ },
205
+ "soft": {
206
+ "description": "soft kernel gate",
207
+ "mode": "test",
208
+ "packet_kind": "problem_scope",
209
+ "gate_ids": ["kernel_soft"],
210
+ },
211
+ "legacy": {
212
+ "description": "legacy structure kernel gate",
213
+ "mode": "test",
214
+ "packet_kind": "problem_scope",
215
+ "gate_ids": ["kernel_legacy"],
216
+ },
217
+ },
218
+ "gates": [
219
+ {
220
+ "id": "kernel_hard",
221
+ "phase": "structure_kernel",
222
+ "command": "true",
223
+ "pass": {"exit_codes": [0]},
224
+ "kernel": {
225
+ "mode": "hard",
226
+ "artifacts": [
227
+ {
228
+ "path": "analysis/invalid-task.kernel.json",
229
+ "artifact_class": "task",
230
+ }
231
+ ],
232
+ },
233
+ },
234
+ {
235
+ "id": "kernel_soft",
236
+ "phase": "structure_kernel",
237
+ "command": "true",
238
+ "pass": {"exit_codes": [0]},
239
+ "kernel": {
240
+ "mode": "soft",
241
+ "artifacts": [
242
+ {
243
+ "path": "analysis/invalid-task.kernel.json",
244
+ "artifact_class": "task",
245
+ }
246
+ ],
247
+ },
248
+ },
249
+ {
250
+ "id": "kernel_legacy",
251
+ "phase": "structure_kernel",
252
+ "command": "true",
253
+ "pass": {"exit_codes": [0]},
254
+ },
255
+ ],
256
+ },
257
+ )
258
+
259
+ hard_ms, hard_proc = _timed_orp(
260
+ root,
261
+ "--config",
262
+ "orp.kernel.bench.json",
263
+ "gate",
264
+ "run",
265
+ "--profile",
266
+ "hard",
267
+ "--json",
268
+ check=False,
269
+ )
270
+ soft_ms, soft_proc = _timed_orp(
271
+ root,
272
+ "--config",
273
+ "orp.kernel.bench.json",
274
+ "gate",
275
+ "run",
276
+ "--profile",
277
+ "soft",
278
+ "--json",
279
+ )
280
+ legacy_ms, legacy_proc = _timed_orp(
281
+ root,
282
+ "--config",
283
+ "orp.kernel.bench.json",
284
+ "gate",
285
+ "run",
286
+ "--profile",
287
+ "legacy",
288
+ "--json",
289
+ )
290
+
291
+ hard_payload = json.loads(hard_proc.stdout)
292
+ soft_payload = json.loads(soft_proc.stdout)
293
+ legacy_payload = json.loads(legacy_proc.stdout)
294
+
295
+ hard_result = json.loads((root / hard_payload["run_record"]).read_text(encoding="utf-8"))["results"][0]
296
+ soft_result = json.loads((root / soft_payload["run_record"]).read_text(encoding="utf-8"))["results"][0]
297
+ legacy_result = json.loads((root / legacy_payload["run_record"]).read_text(encoding="utf-8"))["results"][0]
298
+
299
+ return {
300
+ "hard_mode": {
301
+ "ms": round(hard_ms, 3),
302
+ "exit_code": hard_proc.returncode,
303
+ "overall": hard_payload["overall"],
304
+ "kernel_valid": hard_result["kernel_validation"]["valid"],
305
+ "missing_fields": hard_result["kernel_validation"]["artifacts"][0]["missing_fields"],
306
+ },
307
+ "soft_mode": {
308
+ "ms": round(soft_ms, 3),
309
+ "exit_code": soft_proc.returncode,
310
+ "overall": soft_payload["overall"],
311
+ "kernel_valid": soft_result["kernel_validation"]["valid"],
312
+ },
313
+ "legacy_compatibility": {
314
+ "ms": round(legacy_ms, 3),
315
+ "exit_code": legacy_proc.returncode,
316
+ "overall": legacy_payload["overall"],
317
+ "has_kernel_validation": "kernel_validation" in legacy_result,
318
+ },
319
+ "meets_expectations": {
320
+ "hard_blocks_invalid_artifact": hard_proc.returncode == 1
321
+ and hard_payload["overall"] == "FAIL"
322
+ and hard_result["kernel_validation"]["valid"] is False,
323
+ "soft_allows_invalid_artifact_with_advisory": soft_proc.returncode == 0
324
+ and soft_payload["overall"] == "PASS"
325
+ and soft_result["kernel_validation"]["valid"] is False,
326
+ "legacy_structure_kernel_remains_compatible": legacy_proc.returncode == 0
327
+ and legacy_payload["overall"] == "PASS"
328
+ and "kernel_validation" not in legacy_result,
329
+ },
330
+ }
331
+
332
+
333
+ def _gather_metadata() -> dict[str, Any]:
334
+ package_version = json.loads((REPO_ROOT / "package.json").read_text(encoding="utf-8"))["version"]
335
+ commit = _run(["git", "rev-parse", "HEAD"]).stdout.strip()
336
+ branch = _run(["git", "rev-parse", "--abbrev-ref", "HEAD"]).stdout.strip()
337
+ node_version = _run(["node", "--version"]).stdout.strip()
338
+ return {
339
+ "generated_at_utc": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
340
+ "repo_commit": commit,
341
+ "repo_branch": branch,
342
+ "package_version": package_version,
343
+ "python_version": sys.version.split()[0],
344
+ "node_version": node_version,
345
+ "platform": platform.platform(),
346
+ }
347
+
348
+
349
+ def build_report(iterations: int) -> dict[str, Any]:
350
+ init_benchmark = _benchmark_init_starter(iterations)
351
+ roundtrip_benchmark = _benchmark_artifact_roundtrip()
352
+ gate_mode_benchmark = _benchmark_gate_modes()
353
+
354
+ claims = [
355
+ {
356
+ "id": "starter_kernel_bootstrap",
357
+ "claim": "orp init seeds a valid starter kernel artifact and a passing default structure_kernel gate.",
358
+ "status": "pass",
359
+ "evidence": [
360
+ "benchmarks.init_starter_kernel",
361
+ "cli/orp.py",
362
+ "tests/test_orp_init.py",
363
+ ],
364
+ },
365
+ {
366
+ "id": "typed_artifact_roundtrip",
367
+ "claim": "All seven v0.1 artifact classes can be scaffolded and validated through the CLI.",
368
+ "status": "pass" if roundtrip_benchmark["artifact_classes_total"] == 7 else "fail",
369
+ "evidence": [
370
+ "benchmarks.artifact_roundtrip",
371
+ "spec/v1/kernel.schema.json",
372
+ "tests/test_orp_kernel.py",
373
+ ],
374
+ },
375
+ {
376
+ "id": "promotion_enforcement_modes",
377
+ "claim": "Hard mode blocks invalid promotable artifacts, while soft mode records advisory issues without blocking.",
378
+ "status": "pass"
379
+ if gate_mode_benchmark["meets_expectations"]["hard_blocks_invalid_artifact"]
380
+ and gate_mode_benchmark["meets_expectations"]["soft_allows_invalid_artifact_with_advisory"]
381
+ else "fail",
382
+ "evidence": [
383
+ "benchmarks.gate_modes",
384
+ "tests/test_orp_kernel.py",
385
+ ],
386
+ },
387
+ {
388
+ "id": "legacy_structure_kernel_compatibility",
389
+ "claim": "Existing structure_kernel gates without explicit kernel config remain compatible.",
390
+ "status": "pass"
391
+ if gate_mode_benchmark["meets_expectations"]["legacy_structure_kernel_remains_compatible"]
392
+ else "fail",
393
+ "evidence": [
394
+ "benchmarks.gate_modes",
395
+ "cli/orp.py",
396
+ ],
397
+ },
398
+ {
399
+ "id": "local_cli_kernel_ergonomics",
400
+ "claim": "One-shot kernel CLI operations remain within human-scale local ergonomics targets on the reference machine.",
401
+ "status": "pass"
402
+ if all(init_benchmark["meets_targets"].values())
403
+ and all(roundtrip_benchmark["meets_targets"].values())
404
+ else "fail",
405
+ "evidence": [
406
+ "benchmarks.init_starter_kernel",
407
+ "benchmarks.artifact_roundtrip",
408
+ ],
409
+ },
410
+ ]
411
+
412
+ return {
413
+ "schema_version": "1.0.0",
414
+ "kind": "orp_reasoning_kernel_validation_report",
415
+ "metadata": _gather_metadata(),
416
+ "benchmarks": {
417
+ "init_starter_kernel": init_benchmark,
418
+ "artifact_roundtrip": roundtrip_benchmark,
419
+ "gate_modes": gate_mode_benchmark,
420
+ },
421
+ "claims": claims,
422
+ "summary": {
423
+ "all_claims_pass": all(row["status"] == "pass" for row in claims),
424
+ "artifact_classes_total": roundtrip_benchmark["artifact_classes_total"],
425
+ "all_performance_targets_met": all(init_benchmark["meets_targets"].values())
426
+ and all(roundtrip_benchmark["meets_targets"].values()),
427
+ },
428
+ }
429
+
430
+
431
+ def main() -> int:
432
+ parser = argparse.ArgumentParser(description="Benchmark and validate ORP Reasoning Kernel v0.1")
433
+ parser.add_argument("--out", default="", help="Optional JSON output path")
434
+ parser.add_argument("--iterations", type=int, default=5, help="Iterations for bootstrap benchmark")
435
+ parser.add_argument("--quick", action="store_true", help="Use a single bootstrap iteration for fast checks")
436
+ args = parser.parse_args()
437
+
438
+ iterations = 1 if args.quick else max(1, args.iterations)
439
+ report = build_report(iterations)
440
+ payload = json.dumps(report, indent=2) + "\n"
441
+ if args.out:
442
+ out_path = Path(args.out)
443
+ if not out_path.is_absolute():
444
+ out_path = REPO_ROOT / out_path
445
+ out_path.parent.mkdir(parents=True, exist_ok=True)
446
+ out_path.write_text(payload, encoding="utf-8")
447
+ print(payload, end="")
448
+ return 0 if report["summary"]["all_claims_pass"] else 1
449
+
450
+
451
+ if __name__ == "__main__":
452
+ raise SystemExit(main())