open-research-protocol 0.4.6 → 0.4.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/README.md +10 -0
  2. package/cli/orp.py +668 -43
  3. package/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md +125 -0
  4. package/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md +97 -0
  5. package/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md +100 -0
  6. package/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md +116 -0
  7. package/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md +86 -0
  8. package/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md +261 -0
  9. package/docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md +131 -0
  10. package/docs/ORP_REASONING_KERNEL_EVOLUTION.md +123 -0
  11. package/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md +107 -0
  12. package/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md +471 -0
  13. package/docs/ORP_REASONING_KERNEL_V0_1.md +15 -0
  14. package/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json +796 -0
  15. package/docs/benchmarks/orp_reasoning_kernel_agent_replication_task_smoke.json +487 -0
  16. package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_1.json +1927 -0
  17. package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json +10217 -0
  18. package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_task_smoke.json +174 -0
  19. package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json +598 -0
  20. package/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json +688 -0
  21. package/docs/benchmarks/orp_reasoning_kernel_continuation_task_smoke.json +150 -0
  22. package/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json +448 -0
  23. package/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json +594 -0
  24. package/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json +925 -0
  25. package/examples/README.md +2 -0
  26. package/examples/kernel/comparison/comparison-corpus.json +337 -0
  27. package/examples/kernel/comparison/next-task-continuation.json +55 -0
  28. package/examples/kernel/corpus/operations/habanero-routing.checkpoint.kernel.yml +12 -0
  29. package/examples/kernel/corpus/operations/runner-routing.policy.kernel.yml +9 -0
  30. package/examples/kernel/corpus/product/project-home.decision.kernel.yml +11 -0
  31. package/examples/kernel/corpus/research/kernel-handoff.experiment.kernel.yml +16 -0
  32. package/examples/kernel/corpus/research/lane-drift.hypothesis.kernel.yml +11 -0
  33. package/examples/kernel/corpus/software/trace-widget.task.kernel.yml +13 -0
  34. package/examples/kernel/corpus/writing/kernel-launch.result.kernel.yml +12 -0
  35. package/package.json +4 -1
  36. package/scripts/orp-kernel-agent-pilot.py +673 -0
  37. package/scripts/orp-kernel-agent-replication.py +307 -0
  38. package/scripts/orp-kernel-benchmark.py +921 -0
  39. package/scripts/orp-kernel-canonical-continuation.py +381 -0
  40. package/scripts/orp-kernel-ci-check.py +138 -0
  41. package/scripts/orp-kernel-comparison.py +592 -0
  42. package/scripts/orp-kernel-continuation-pilot.py +384 -0
  43. package/scripts/orp-kernel-pickup.py +401 -0
  44. package/spec/v1/kernel-extension.schema.json +96 -0
  45. package/spec/v1/kernel-proposal.schema.json +115 -0
  46. package/spec/v1/kernel.schema.json +2 -1
@@ -0,0 +1,261 @@
1
+ # ORP Reasoning Kernel Evaluation Plan
2
+
3
+ This document turns the remaining kernel evidence gaps into concrete next
4
+ experiments.
5
+
6
+ The goal is to upgrade the kernel from:
7
+
8
+ - technically valid and operationally useful
9
+
10
+ to:
11
+
12
+ - comparatively validated against real alternatives and real project outcomes
13
+
14
+ Supporting references:
15
+
16
+ - [docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md)
17
+ - [docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md)
18
+ - [docs/ORP_REASONING_KERNEL_AGENT_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md)
19
+ - [docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md)
20
+ - [docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md)
21
+ - [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md)
22
+ - [docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md)
23
+ - [docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md)
24
+
25
+ ## Evaluation Principles
26
+
27
+ Every kernel evaluation should be:
28
+
29
+ - comparative, not just descriptive
30
+ - cross-domain where possible
31
+ - judged by downstream usefulness, not only schema validity
32
+ - reproducible and artifact-backed
33
+
34
+ The main alternatives to compare against are:
35
+
36
+ 1. Free-form artifact writing
37
+ 2. Generic checklist artifact writing
38
+ 3. ORP typed kernel artifact writing
39
+
40
+ The current internal package now includes:
41
+
42
+ - deterministic structural comparison
43
+ - deterministic pickup proxy
44
+ - live fresh-agent recoverability simulation
45
+ - live fresh-agent replication
46
+ - softer downstream continuation
47
+ - harder canonical downstream continuation
48
+ - CI threshold checks over the committed benchmark package
49
+
50
+ ## Experiment 1: Artifact Clarity And Completeness
51
+
52
+ Status:
53
+
54
+ - a first deterministic internal comparison harness now exists
55
+ - what remains is blinded human scoring and a larger prompt set
56
+
57
+ ### Question
58
+
59
+ Does the ORP kernel produce more complete and legible promotable artifacts than
60
+ free-form writing or a generic checklist?
61
+
62
+ ### Setup
63
+
64
+ - Select 20 prompts spread across:
65
+ - software
66
+ - research
67
+ - product/design
68
+ - operations
69
+ - writing/knowledge work
70
+ - For each prompt, produce three artifact versions:
71
+ - free-form
72
+ - generic checklist
73
+ - ORP kernel
74
+
75
+ ### Scoring
76
+
77
+ Blind-review each artifact against:
78
+
79
+ - artifact type clarity
80
+ - boundary clarity
81
+ - constraint clarity
82
+ - evaluation clarity
83
+ - handoff readiness
84
+ - ambiguity remaining
85
+
86
+ ### Primary metric
87
+
88
+ - mean reviewer score by condition
89
+
90
+ ### Success criterion
91
+
92
+ - kernel condition beats free-form and generic checklist on at least four of
93
+ six scoring dimensions
94
+
95
+ ## Experiment 2: Handoff Pickup Study
96
+
97
+ Status:
98
+
99
+ - a first deterministic pickup proxy now exists
100
+ - a first live Codex pickup simulation now exists
101
+ - a `10`-repeat live Codex replication layer now exists with per-field stability tables
102
+ - what remains is live human pickup measurement and broader cross-model replication
103
+
104
+ ### Question
105
+
106
+ Does the ORP kernel improve pickup quality for another human or agent?
107
+
108
+ ### Setup
109
+
110
+ - Use matched artifacts from Experiment 1
111
+ - Give a second operator one artifact at a time and ask them to:
112
+ - explain the task
113
+ - state the constraints
114
+ - identify success criteria
115
+ - identify next action
116
+
117
+ ### Scoring
118
+
119
+ - time to correct interpretation
120
+ - interpretation accuracy
121
+ - number of clarifying questions required
122
+
123
+ ### Primary metric
124
+
125
+ - successful pickup rate without clarification
126
+
127
+ ### Success criterion
128
+
129
+ - kernel artifacts reduce clarifying questions and increase correct pickup rate
130
+
131
+ ## Experiment 3: Downstream Execution Quality
132
+
133
+ Status:
134
+
135
+ - a softer live continuation benchmark exists
136
+ - a harder canonical continuation benchmark now exists
137
+ - what remains is repeated canonical continuation, cross-model continuation,
138
+ and true downstream execution quality beyond artifact production
139
+
140
+ ### Question
141
+
142
+ Does kernel-structured promotion improve downstream execution or review
143
+ success?
144
+
145
+ ### Setup
146
+
147
+ - Choose a fixed set of implementation or research tasks
148
+ - Feed matched task artifacts to agents or operators
149
+ - Compare execution using:
150
+ - free-form task definition
151
+ - generic checklist task definition
152
+ - kernel task artifact
153
+
154
+ ### Scoring
155
+
156
+ - completion rate
157
+ - rework rate
158
+ - reviewer acceptance
159
+ - alignment with stated constraints
160
+ - mismatch between claimed and delivered outcome
161
+
162
+ ### Primary metric
163
+
164
+ - accepted completion rate with minimal rework
165
+
166
+ ### Success criterion
167
+
168
+ - kernel condition improves acceptance or reduces rework materially
169
+
170
+ ## Experiment 4: Cross-Domain Corpus Fit
171
+
172
+ ### Question
173
+
174
+ Do the current kernel artifact classes fit real project work across domains?
175
+
176
+ ### Setup
177
+
178
+ - Collect 50 to 100 real artifacts across:
179
+ - software
180
+ - research
181
+ - design/product
182
+ - ops/reliability
183
+ - writing/editorial
184
+ - Map each artifact into the current kernel classes
185
+
186
+ ### Scoring
187
+
188
+ - clean fit
189
+ - awkward fit
190
+ - no fit
191
+ - missing field pressure
192
+ - repeated need for new field or class
193
+
194
+ ### Primary metric
195
+
196
+ - percent of artifacts that map cleanly without schema strain
197
+
198
+ ### Success criterion
199
+
200
+ - at least 80 percent clean fit across the corpus
201
+
202
+ ## Experiment 5: Operational Warning Value
203
+
204
+ ### Question
205
+
206
+ Do soft-mode kernel warnings predict later rework or low-quality promotion?
207
+
208
+ ### Setup
209
+
210
+ - instrument real ORP repos using soft-mode kernel gates
211
+ - log:
212
+ - warning presence
213
+ - later edits to the same artifact
214
+ - eventual hard-mode pass/fail
215
+ - downstream rework indicators
216
+
217
+ ### Primary metric
218
+
219
+ - correlation between early warnings and later rework
220
+
221
+ ### Success criterion
222
+
223
+ - warnings show predictive value strong enough to justify continued soft-mode
224
+ emphasis
225
+
226
+ ## Suggested Order
227
+
228
+ The best order is:
229
+
230
+ 1. Artifact clarity and completeness
231
+ 2. Handoff pickup
232
+ 3. Cross-domain corpus fit
233
+ 4. Downstream execution quality
234
+ 5. Operational warning value
235
+
236
+ That sequence gives the fastest evidence on whether the kernel is genuinely
237
+ useful before investing in longer operational studies.
238
+
239
+ ## Minimal Evidence Package To Upgrade v0.1 Claims
240
+
241
+ If we want to move from “strong implementation” to “serious comparative
242
+ validation,” the smallest next package is:
243
+
244
+ 1. a 20-prompt comparative artifact study
245
+ 2. a pickup study on those same artifacts
246
+ 3. a cross-domain corpus fit table
247
+
248
+ That would be enough to substantially strengthen the kernel’s public claims.
249
+
250
+ ## Bottom Line
251
+
252
+ The kernel is already technically real and operationally validated.
253
+
254
+ What remains is comparative evidence:
255
+
256
+ - better than free-form?
257
+ - better than a generic checklist?
258
+ - good across real domains?
259
+ - useful in real handoffs?
260
+
261
+ This plan defines how to answer those questions cleanly.
@@ -0,0 +1,131 @@
1
+ # ORP Reasoning Kernel Evidence Matrix
2
+
3
+ This document separates what the ORP Reasoning Kernel currently proves from
4
+ what it only suggests.
5
+
6
+ Its purpose is to prevent the kernel from being over-claimed. The kernel is
7
+ stronger when we can say, precisely:
8
+
9
+ - what is already validated
10
+ - what is only partially supported
11
+ - what is still unproven
12
+ - what experiment would close the gap
13
+
14
+ Supporting references:
15
+
16
+ - [docs/ORP_REASONING_KERNEL_V0_1.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_V0_1.md)
17
+ - [docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md)
18
+ - [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json)
19
+ - [docs/ORP_REASONING_KERNEL_AGENT_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md)
20
+ - [docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md)
21
+ - [docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md)
22
+ - [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md)
23
+
24
+ ## Evidence Grades
25
+
26
+ - `A`
27
+ Directly supported by shipped implementation, tests, and repeatable benchmark
28
+ evidence in this repo.
29
+ - `B`
30
+ Strongly supported by implementation behavior and design logic, but still
31
+ missing comparative or external validation.
32
+ - `C`
33
+ Plausible and well-motivated, but not yet measured directly.
34
+ - `D`
35
+ Strategic aspiration only. No meaningful validation yet.
36
+
37
+ ## What Is Sealed For v0.1
38
+
39
+ These claims are strong enough to treat as validated implementation truths for
40
+ the current kernel release.
41
+
42
+ | Claim | Grade | Current Evidence | Why It Matters |
43
+ | --- | --- | --- | --- |
44
+ | ORP has a real typed kernel artifact surface. | A | [spec/v1/kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json), [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py) | The kernel is not just prose. It is an enforceable CLI surface. |
45
+ | `orp init` seeds a valid starter kernel artifact and validates it in the default flow. | A | [tests/test_orp_init.py](/Volumes/Code_2TB/code/orp/tests/test_orp_init.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | New repos get the kernel by default instead of needing manual adoption. |
46
+ | All seven v0.1 artifact classes can scaffold and validate successfully. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel is broad enough for multiple project artifact types. |
47
+ | Hard mode blocks invalid promotable artifacts. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | ORP can enforce structural promotion standards rather than only advising. |
48
+ | Soft mode records invalidity without blocking work. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | ORP can stay fluid at intake while still surfacing missing structure. |
49
+ | Existing `structure_kernel` gates remain compatible when no explicit kernel config is present. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel does not silently break earlier ORP configurations. |
50
+ | One-shot local kernel CLI operations are within human-scale latency on the reference machine. | A | [scripts/orp-kernel-benchmark.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-benchmark.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel is operationally lightweight enough to use during normal work. |
51
+ | A small cross-domain reference corpus fits the current class set cleanly. | A | [examples/kernel/corpus](/Volumes/Code_2TB/code/orp/examples/kernel/corpus), [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel now has explicit cross-domain fit evidence, not only rationale. |
52
+ | Each artifact class rejects a candidate when a required field is removed. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | Class-specific enforcement is directly proven instead of inferred from a subset of cases. |
53
+ | The CLI validator stays aligned with the published kernel schema. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel no longer relies on an undocumented validator rule set drifting away from the schema. |
54
+ | Equivalent YAML and JSON artifacts validate to the same semantic result. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The protocol is representation-stable rather than format-sensitive. |
55
+ | The validator rejects adversarial near-miss artifacts. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel is stronger against malformed or gameable inputs than before. |
56
+ | On a matched internal comparison corpus, kernel artifacts outperform both free-form and generic checklist artifacts on structural scoring. | A | [docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json), [scripts/orp-kernel-comparison.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-comparison.py) | ORP now has direct comparative evidence for structural artifact quality on a matched internal corpus, not only rationale. |
57
+ | On a matched internal pickup proxy, kernel artifacts preserve more explicit handoff-critical information than both free-form and generic checklist artifacts. | A | [docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json), [scripts/orp-kernel-pickup.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-pickup.py) | ORP now has a second comparative signal showing that kernel structure turns into more explicit pickup value, not just fuller-looking artifacts. |
58
+ | On a matched live Codex recoverability simulation, kernel artifacts preserve full required-field recoverability, outperform free-form artifacts on all matched cases, and outperform generic checklist artifacts on average without per-case losses. | A | [docs/ORP_REASONING_KERNEL_AGENT_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json), [scripts/orp-kernel-agent-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-pilot.py) | ORP now has direct in-environment agent evidence that the kernel’s structural advantage survives contact with a real fresh downstream Codex session. |
59
+ | On a `10`-repeat full-corpus live Codex replication pilot, the kernel’s recoverability advantage stays stable across fresh-session reruns, with zero invention, no run-level losses, and perfect per-field stability on required kernel fields. | A | [docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md), [docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json), [scripts/orp-kernel-agent-replication.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-replication.py) | ORP now has stronger repeatability evidence that the live agent result is not just a single-run artifact and that the structural advantage survives at field level, not only in aggregate means. |
60
+ | On a matched full-corpus live continuation pilot, kernel artifacts support the strongest continuation score, never underperform the generic checklist baseline, and keep invention at zero. | A | [docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json), [scripts/orp-kernel-continuation-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-continuation-pilot.py) | ORP now has direct agent-first evidence that kernel artifacts are not only recoverable, but also a safe and effective base for downstream continuation. |
61
+ | On a harder matched full-corpus canonical continuation pilot, kernel artifacts beat free-form on every case, beat checklist on average, and keep the lowest invention rate while revealing checklist as a real competitive baseline on some cases. | A | [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json), [scripts/orp-kernel-canonical-continuation.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-canonical-continuation.py) | ORP now has a stricter downstream-agent benchmark where the task is not merely “continue safely,” but “produce the next canonical artifact” without inventing unsupported structure. |
62
+
63
+ ## What Is Strong But Not Fully Sealed
64
+
65
+ These claims are directionally convincing, but still need comparative or
66
+ broader validation before they should be presented as fully proven.
67
+
68
+ | Claim | Grade | Current Evidence | Missing Evidence | Best Next Experiment |
69
+ | --- | --- | --- | --- | --- |
70
+ | The seven chosen artifact classes are a good universal first set. | B | Cross-domain rationale, passing roundtrip coverage, a small five-domain reference corpus, and full required-field enforcement across all classes | Larger real-world corpus coverage across multiple project types | Cross-domain corpus fit study at real-project scale |
71
+ | The kernel reduces accidental chat-to-truth drift. | B | ORP’s promotion model and hard/soft gate behavior | Before/after repo evidence showing fewer ambiguous promotable artifacts | Comparative artifact hygiene study |
72
+ | The kernel is better than a generic checklist. | B | Typed classes give stronger semantic requirements than one generic field list, the internal comparison pilot shows a structural scoring advantage, the live Codex pilot shows a higher mean recoverability score with no per-case losses, and the harder canonical continuation pilot still favors the kernel on aggregate while showing checklist is genuinely competitive | Blinded human review and downstream task outcomes | Structured baseline comparison |
73
+ | The kernel is better than free-form artifact writing alone. | B | Implementation logic is strong, promotion semantics exist, the internal comparison and pickup pilots favor the kernel, and the live Codex pilot shows a `7/7` case advantage on recoverability | Outcome comparison on real artifacts and reviewers | Free-form vs kernel artifact review study |
74
+ | The kernel can scale across agents and handoffs. | B | Machine-checkable artifacts, visible run traces, a matched internal pickup proxy, a live Codex recoverability pilot showing stronger downstream field recovery, and first replication/continuation smokes that preserve the same general pattern | Multi-operator pickup data across models and humans | Handoff pickup experiment |
75
+
76
+ ## What Is Still Unproven
77
+
78
+ These are important claims, but they should not be stated as established fact
79
+ yet.
80
+
81
+ | Claim | Grade | Why It Is Not Yet Proven | What Would Prove It |
82
+ | --- | --- | --- | --- |
83
+ | Kernel use improves downstream project outcomes. | D | No controlled comparison yet between kernel and non-kernel workflows | Comparative study over real tasks with outcome scoring |
84
+ | Kernel-aware agents produce better work than agents without kernel structure. | D | No A/B benchmark on identical prompts and acceptance criteria | Agent benchmark across matched tasks |
85
+ | Kernel warnings correlate with later rework or quality failures. | D | No operational data collection yet | Longitudinal study on soft-mode warnings vs later edits |
86
+ | The current class set is close to optimal. | D | No competing class-model comparison or pruning study | Corpus analysis plus ablation tests |
87
+ | The kernel is equally suitable across software, research, ops, and writing domains. | C | Small cross-domain corpus fit is good, but no broad real-project or reviewer evidence yet | Cross-domain corpus and reviewer study |
88
+
89
+ ## Research Questions That Matter Most
90
+
91
+ If we want to move from “technically valid” to “research-grade validated,” the
92
+ highest-value questions are:
93
+
94
+ 1. Does the kernel reduce ambiguity relative to free-form project artifacts?
95
+ 2. Does the kernel improve handoff pickup quality for humans and agents?
96
+ 3. Does the kernel improve downstream implementation or review success?
97
+ 4. Is the typed class model actually better than a simpler generic checklist?
98
+ 5. Where does the current class set fail across real project domains?
99
+
100
+ ## Recommended Public Wording Right Now
101
+
102
+ These are safe claims now:
103
+
104
+ - ORP ships a typed reasoning-kernel artifact layer.
105
+ - ORP can scaffold, validate, and gate promotable kernel artifacts.
106
+ - ORP supports hard and soft promotion semantics for kernel validation.
107
+ - ORP includes repeatable benchmark evidence for the current implementation.
108
+
109
+ These should still be framed as goals or hypotheses:
110
+
111
+ - the kernel improves project outcomes
112
+ - the kernel is superior to other artifact-structuring approaches
113
+ - the current class set is the right final ontology
114
+
115
+ ## Bottom Line
116
+
117
+ The kernel is sealed for `v0.1` as an implementation and protocol surface.
118
+
119
+ Its internal validity is now much stronger than a simple happy-path release:
120
+
121
+ - schema and validator rules align directly
122
+ - every required field is exercised through ablation
123
+ - equivalent YAML and JSON artifacts behave the same
124
+ - adversarial near-miss artifacts are rejected
125
+ - a small cross-domain corpus fits the current class set
126
+ - a live fresh-agent Codex pilot shows the kernel’s recoverability advantage survives beyond deterministic local scoring
127
+ - a `10`-repeat full-corpus replication pilot and two levels of continuation pilot suggest that advantage is reasonably stable and carries into downstream agent continuation behavior
128
+
129
+ It is not yet sealed as a universally outcome-superior project methodology.
130
+
131
+ That is a good place to be if we stay explicit about the difference.
@@ -0,0 +1,123 @@
1
+ # ORP Reasoning Kernel Evolution
2
+
3
+ This document defines how the ORP kernel should get stronger over time without
4
+ becoming slippery.
5
+
6
+ ## Core Rule
7
+
8
+ The kernel should be:
9
+
10
+ - stable as a contract
11
+ - adaptive through governed evolution
12
+ - never silently rewritten by agents on the fly
13
+
14
+ Short form:
15
+
16
+ - stable core
17
+ - observable pressure
18
+ - explicit evolution
19
+
20
+ ## What Should Stay Stable
21
+
22
+ The current core kernel remains the canonical source of truth for:
23
+
24
+ - artifact classes
25
+ - required-field rules
26
+ - hard vs soft promotion behavior
27
+ - machine-readable kernel validation in `RUN.json`
28
+
29
+ Those semantics live in:
30
+
31
+ - [spec/v1/kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json)
32
+ - [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py)
33
+
34
+ The kernel should not self-mutate from a single chat, a single agent guess, or
35
+ one repo’s habits.
36
+
37
+ ## What Should Evolve
38
+
39
+ The kernel should evolve from evidence about real use:
40
+
41
+ - repeated missing fields
42
+ - repeated field invention
43
+ - repeated continuation failures
44
+ - recurring requests for the same extra structure
45
+ - extension fields that become broadly useful
46
+
47
+ That evidence should shape proposals and migrations, not rewrite the core
48
+ implicitly.
49
+
50
+ ## CLI Surfaces
51
+
52
+ ORP now exposes three explicit kernel-evolution surfaces:
53
+
54
+ ### `orp kernel stats`
55
+
56
+ Observe real kernel-validation pressure from `RUN.json` artifacts.
57
+
58
+ Use it to answer questions like:
59
+
60
+ - which fields are repeatedly missing?
61
+ - which artifact classes fail most often?
62
+ - where is the current kernel strained in live repo usage?
63
+
64
+ ### `orp kernel propose`
65
+
66
+ Scaffold a governed kernel-evolution proposal artifact.
67
+
68
+ Use it for changes like:
69
+
70
+ - adding a field
71
+ - introducing a new artifact class
72
+ - changing a requirement
73
+ - deprecating an old field
74
+
75
+ Proposal shape is governed by:
76
+
77
+ - [spec/v1/kernel-proposal.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel-proposal.schema.json)
78
+
79
+ ### `orp kernel migrate`
80
+
81
+ Rewrite an artifact into the current canonical field order and schema version.
82
+
83
+ Use it to:
84
+
85
+ - normalize older artifacts
86
+ - apply explicit schema-version upgrades
87
+ - preserve stable truth while the kernel evolves
88
+
89
+ ## Extensions
90
+
91
+ The best place for new pressure to land first is usually not the core kernel.
92
+ It should begin as an extension or proposal before becoming universal.
93
+
94
+ Extension shape is defined in:
95
+
96
+ - [spec/v1/kernel-extension.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel-extension.schema.json)
97
+
98
+ That gives ORP a place to trial domain-specific structure without forcing it
99
+ into every project prematurely.
100
+
101
+ ## Recommended Kernel Evolution Loop
102
+
103
+ 1. Observe pressure with `orp kernel stats`
104
+ 2. Write an explicit proposal with `orp kernel propose`
105
+ 3. Test the proposal against benchmark corpus and live agent pickup/continuation
106
+ 4. If accepted, version the schema deliberately
107
+ 5. Normalize older artifacts with `orp kernel migrate`
108
+ 6. Protect the committed evidence package with CI threshold checks
109
+
110
+ ## Non-Goal
111
+
112
+ The kernel should not become a hidden adaptive prompt system that silently
113
+ changes what truth means.
114
+
115
+ The repository should always be able to answer:
116
+
117
+ - which kernel version is in effect?
118
+ - why was it changed?
119
+ - what evidence justified the change?
120
+ - how do older artifacts migrate safely?
121
+
122
+ That is the ORP standard for a living protocol: dynamic in evidence, explicit
123
+ in truth.
@@ -0,0 +1,107 @@
1
+ # ORP Reasoning Kernel Pickup Pilot
2
+
3
+ This document records the first in-repo pickup and handoff proxy for the ORP
4
+ Reasoning Kernel.
5
+
6
+ Supporting artifact:
7
+
8
+ - [docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json)
9
+
10
+ Supporting corpus and harness:
11
+
12
+ - [examples/kernel/comparison/comparison-corpus.json](/Volumes/Code_2TB/code/orp/examples/kernel/comparison/comparison-corpus.json)
13
+ - [scripts/orp-kernel-pickup.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-pickup.py)
14
+
15
+ ## What This Pilot Measures
16
+
17
+ This pilot measures **explicit pickup readiness** on a matched internal corpus.
18
+
19
+ For each artifact class, it asks whether a downstream operator could recover
20
+ the class-specific pickup targets directly from the artifact itself.
21
+
22
+ Examples:
23
+
24
+ - `task`
25
+ - object
26
+ - constraints
27
+ - success criteria
28
+ - `decision`
29
+ - question
30
+ - chosen path
31
+ - consequences
32
+ - `checkpoint`
33
+ - current state
34
+ - risks
35
+ - next handoff target
36
+
37
+ The pilot compares:
38
+
39
+ 1. free-form artifact writing
40
+ 2. generic checklist artifact writing
41
+ 3. ORP typed kernel artifact writing
42
+
43
+ ## What This Is And Is Not
44
+
45
+ This is a **pickup proxy**, not a full live handoff study.
46
+
47
+ It does **not** prove:
48
+
49
+ - time-to-understanding with real operators
50
+ - clarification count under live review
51
+ - downstream execution quality
52
+
53
+ It **does** prove something narrower and still valuable:
54
+
55
+ how much pickup-critical information remains explicitly recoverable in the
56
+ artifact itself.
57
+
58
+ ## Current Result
59
+
60
+ On the matched internal corpus, the current report shows:
61
+
62
+ - kernel mean pickup score: `1.000`
63
+ - generic checklist mean pickup score: `0.743`
64
+ - free-form mean pickup score: `0.452`
65
+
66
+ Pairwise result:
67
+
68
+ - kernel beats generic checklist on `7/7` cases
69
+ - kernel beats free-form on `7/7` cases
70
+ - generic checklist beats free-form on `7/7` cases
71
+
72
+ Additional result:
73
+
74
+ - kernel keeps all pickup targets explicitly answerable on the matched corpus
75
+
76
+ ## Why This Matters
77
+
78
+ This pilot strengthens the kernel story in a way the pure structure benchmark
79
+ could not.
80
+
81
+ The earlier comparison pilot showed that kernel artifacts are structurally
82
+ fuller than the simpler alternatives.
83
+
84
+ This pickup pilot shows that the added structure is not decorative. It turns
85
+ into directly recoverable handoff value.
86
+
87
+ That is important because ORP is not trying to optimize for pretty artifacts.
88
+ It is trying to optimize for artifacts that another human or agent can pick up
89
+ and continue without confusion.
90
+
91
+ ## Honest Caveat
92
+
93
+ This pilot still does not seal the kernel as a universally outcome-superior
94
+ methodology.
95
+
96
+ It is stronger evidence than a rationale-only claim, but it remains an
97
+ internal, deterministic proxy. The next step after this is still a live
98
+ human/agent pickup study as described in
99
+ [docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md).
100
+
101
+ ## Bottom Line
102
+
103
+ The pickup pilot makes the kernel harder to dismiss.
104
+
105
+ We now have evidence that on a matched internal corpus, kernel artifacts do
106
+ not just score as more structured. They also preserve more explicit handoff
107
+ value than free-form and generic checklist alternatives.