open-research-protocol 0.4.6 → 0.4.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -0
- package/cli/orp.py +668 -43
- package/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md +125 -0
- package/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md +97 -0
- package/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md +100 -0
- package/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md +116 -0
- package/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md +86 -0
- package/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md +261 -0
- package/docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md +131 -0
- package/docs/ORP_REASONING_KERNEL_EVOLUTION.md +123 -0
- package/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md +107 -0
- package/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md +471 -0
- package/docs/ORP_REASONING_KERNEL_V0_1.md +15 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json +796 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_replication_task_smoke.json +487 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_1.json +1927 -0
- package/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json +10217 -0
- package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_task_smoke.json +174 -0
- package/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json +598 -0
- package/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json +688 -0
- package/docs/benchmarks/orp_reasoning_kernel_continuation_task_smoke.json +150 -0
- package/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json +448 -0
- package/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json +594 -0
- package/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json +925 -0
- package/examples/README.md +2 -0
- package/examples/kernel/comparison/comparison-corpus.json +337 -0
- package/examples/kernel/comparison/next-task-continuation.json +55 -0
- package/examples/kernel/corpus/operations/habanero-routing.checkpoint.kernel.yml +12 -0
- package/examples/kernel/corpus/operations/runner-routing.policy.kernel.yml +9 -0
- package/examples/kernel/corpus/product/project-home.decision.kernel.yml +11 -0
- package/examples/kernel/corpus/research/kernel-handoff.experiment.kernel.yml +16 -0
- package/examples/kernel/corpus/research/lane-drift.hypothesis.kernel.yml +11 -0
- package/examples/kernel/corpus/software/trace-widget.task.kernel.yml +13 -0
- package/examples/kernel/corpus/writing/kernel-launch.result.kernel.yml +12 -0
- package/package.json +4 -1
- package/scripts/orp-kernel-agent-pilot.py +673 -0
- package/scripts/orp-kernel-agent-replication.py +307 -0
- package/scripts/orp-kernel-benchmark.py +921 -0
- package/scripts/orp-kernel-canonical-continuation.py +381 -0
- package/scripts/orp-kernel-ci-check.py +138 -0
- package/scripts/orp-kernel-comparison.py +592 -0
- package/scripts/orp-kernel-continuation-pilot.py +384 -0
- package/scripts/orp-kernel-pickup.py +401 -0
- package/spec/v1/kernel-extension.schema.json +96 -0
- package/spec/v1/kernel-proposal.schema.json +115 -0
- package/spec/v1/kernel.schema.json +2 -1
|
@@ -0,0 +1,261 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Evaluation Plan
|
|
2
|
+
|
|
3
|
+
This document turns the remaining kernel evidence gaps into concrete next
|
|
4
|
+
experiments.
|
|
5
|
+
|
|
6
|
+
The goal is to upgrade the kernel from:
|
|
7
|
+
|
|
8
|
+
- technically valid and operationally useful
|
|
9
|
+
|
|
10
|
+
to:
|
|
11
|
+
|
|
12
|
+
- comparatively validated against real alternatives and real project outcomes
|
|
13
|
+
|
|
14
|
+
Supporting references:
|
|
15
|
+
|
|
16
|
+
- [docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md)
|
|
17
|
+
- [docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md)
|
|
18
|
+
- [docs/ORP_REASONING_KERNEL_AGENT_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md)
|
|
19
|
+
- [docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md)
|
|
20
|
+
- [docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md)
|
|
21
|
+
- [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md)
|
|
22
|
+
- [docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_EVIDENCE_MATRIX.md)
|
|
23
|
+
- [docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md)
|
|
24
|
+
|
|
25
|
+
## Evaluation Principles
|
|
26
|
+
|
|
27
|
+
Every kernel evaluation should be:
|
|
28
|
+
|
|
29
|
+
- comparative, not just descriptive
|
|
30
|
+
- cross-domain where possible
|
|
31
|
+
- judged by downstream usefulness, not only schema validity
|
|
32
|
+
- reproducible and artifact-backed
|
|
33
|
+
|
|
34
|
+
The main alternatives to compare against are:
|
|
35
|
+
|
|
36
|
+
1. Free-form artifact writing
|
|
37
|
+
2. Generic checklist artifact writing
|
|
38
|
+
3. ORP typed kernel artifact writing
|
|
39
|
+
|
|
40
|
+
The current internal package now includes:
|
|
41
|
+
|
|
42
|
+
- deterministic structural comparison
|
|
43
|
+
- deterministic pickup proxy
|
|
44
|
+
- live fresh-agent recoverability simulation
|
|
45
|
+
- live fresh-agent replication
|
|
46
|
+
- softer downstream continuation
|
|
47
|
+
- harder canonical downstream continuation
|
|
48
|
+
- CI threshold checks over the committed benchmark package
|
|
49
|
+
|
|
50
|
+
## Experiment 1: Artifact Clarity And Completeness
|
|
51
|
+
|
|
52
|
+
Status:
|
|
53
|
+
|
|
54
|
+
- a first deterministic internal comparison harness now exists
|
|
55
|
+
- what remains is blinded human scoring and a larger prompt set
|
|
56
|
+
|
|
57
|
+
### Question
|
|
58
|
+
|
|
59
|
+
Does the ORP kernel produce more complete and legible promotable artifacts than
|
|
60
|
+
free-form writing or a generic checklist?
|
|
61
|
+
|
|
62
|
+
### Setup
|
|
63
|
+
|
|
64
|
+
- Select 20 prompts spread across:
|
|
65
|
+
- software
|
|
66
|
+
- research
|
|
67
|
+
- product/design
|
|
68
|
+
- operations
|
|
69
|
+
- writing/knowledge work
|
|
70
|
+
- For each prompt, produce three artifact versions:
|
|
71
|
+
- free-form
|
|
72
|
+
- generic checklist
|
|
73
|
+
- ORP kernel
|
|
74
|
+
|
|
75
|
+
### Scoring
|
|
76
|
+
|
|
77
|
+
Blind-review each artifact against:
|
|
78
|
+
|
|
79
|
+
- artifact type clarity
|
|
80
|
+
- boundary clarity
|
|
81
|
+
- constraint clarity
|
|
82
|
+
- evaluation clarity
|
|
83
|
+
- handoff readiness
|
|
84
|
+
- ambiguity remaining
|
|
85
|
+
|
|
86
|
+
### Primary metric
|
|
87
|
+
|
|
88
|
+
- mean reviewer score by condition
|
|
89
|
+
|
|
90
|
+
### Success criterion
|
|
91
|
+
|
|
92
|
+
- kernel condition beats free-form and generic checklist on at least four of
|
|
93
|
+
six scoring dimensions
|
|
94
|
+
|
|
95
|
+
## Experiment 2: Handoff Pickup Study
|
|
96
|
+
|
|
97
|
+
Status:
|
|
98
|
+
|
|
99
|
+
- a first deterministic pickup proxy now exists
|
|
100
|
+
- a first live Codex pickup simulation now exists
|
|
101
|
+
- a `10`-repeat live Codex replication layer now exists with per-field stability tables
|
|
102
|
+
- what remains is live human pickup measurement and broader cross-model replication
|
|
103
|
+
|
|
104
|
+
### Question
|
|
105
|
+
|
|
106
|
+
Does the ORP kernel improve pickup quality for another human or agent?
|
|
107
|
+
|
|
108
|
+
### Setup
|
|
109
|
+
|
|
110
|
+
- Use matched artifacts from Experiment 1
|
|
111
|
+
- Give a second operator one artifact at a time and ask them to:
|
|
112
|
+
- explain the task
|
|
113
|
+
- state the constraints
|
|
114
|
+
- identify success criteria
|
|
115
|
+
- identify next action
|
|
116
|
+
|
|
117
|
+
### Scoring
|
|
118
|
+
|
|
119
|
+
- time to correct interpretation
|
|
120
|
+
- interpretation accuracy
|
|
121
|
+
- number of clarifying questions required
|
|
122
|
+
|
|
123
|
+
### Primary metric
|
|
124
|
+
|
|
125
|
+
- successful pickup rate without clarification
|
|
126
|
+
|
|
127
|
+
### Success criterion
|
|
128
|
+
|
|
129
|
+
- kernel artifacts reduce clarifying questions and increase correct pickup rate
|
|
130
|
+
|
|
131
|
+
## Experiment 3: Downstream Execution Quality
|
|
132
|
+
|
|
133
|
+
Status:
|
|
134
|
+
|
|
135
|
+
- a softer live continuation benchmark exists
|
|
136
|
+
- a harder canonical continuation benchmark now exists
|
|
137
|
+
- what remains is repeated canonical continuation, cross-model continuation,
|
|
138
|
+
and true downstream execution quality beyond artifact production
|
|
139
|
+
|
|
140
|
+
### Question
|
|
141
|
+
|
|
142
|
+
Does kernel-structured promotion improve downstream execution or review
|
|
143
|
+
success?
|
|
144
|
+
|
|
145
|
+
### Setup
|
|
146
|
+
|
|
147
|
+
- Choose a fixed set of implementation or research tasks
|
|
148
|
+
- Feed matched task artifacts to agents or operators
|
|
149
|
+
- Compare execution using:
|
|
150
|
+
- free-form task definition
|
|
151
|
+
- generic checklist task definition
|
|
152
|
+
- kernel task artifact
|
|
153
|
+
|
|
154
|
+
### Scoring
|
|
155
|
+
|
|
156
|
+
- completion rate
|
|
157
|
+
- rework rate
|
|
158
|
+
- reviewer acceptance
|
|
159
|
+
- alignment with stated constraints
|
|
160
|
+
- mismatch between claimed and delivered outcome
|
|
161
|
+
|
|
162
|
+
### Primary metric
|
|
163
|
+
|
|
164
|
+
- accepted completion rate with minimal rework
|
|
165
|
+
|
|
166
|
+
### Success criterion
|
|
167
|
+
|
|
168
|
+
- kernel condition improves acceptance or reduces rework materially
|
|
169
|
+
|
|
170
|
+
## Experiment 4: Cross-Domain Corpus Fit
|
|
171
|
+
|
|
172
|
+
### Question
|
|
173
|
+
|
|
174
|
+
Do the current kernel artifact classes fit real project work across domains?
|
|
175
|
+
|
|
176
|
+
### Setup
|
|
177
|
+
|
|
178
|
+
- Collect 50 to 100 real artifacts across:
|
|
179
|
+
- software
|
|
180
|
+
- research
|
|
181
|
+
- design/product
|
|
182
|
+
- ops/reliability
|
|
183
|
+
- writing/editorial
|
|
184
|
+
- Map each artifact into the current kernel classes
|
|
185
|
+
|
|
186
|
+
### Scoring
|
|
187
|
+
|
|
188
|
+
- clean fit
|
|
189
|
+
- awkward fit
|
|
190
|
+
- no fit
|
|
191
|
+
- missing field pressure
|
|
192
|
+
- repeated need for new field or class
|
|
193
|
+
|
|
194
|
+
### Primary metric
|
|
195
|
+
|
|
196
|
+
- percent of artifacts that map cleanly without schema strain
|
|
197
|
+
|
|
198
|
+
### Success criterion
|
|
199
|
+
|
|
200
|
+
- at least 80 percent clean fit across the corpus
|
|
201
|
+
|
|
202
|
+
## Experiment 5: Operational Warning Value
|
|
203
|
+
|
|
204
|
+
### Question
|
|
205
|
+
|
|
206
|
+
Do soft-mode kernel warnings predict later rework or low-quality promotion?
|
|
207
|
+
|
|
208
|
+
### Setup
|
|
209
|
+
|
|
210
|
+
- instrument real ORP repos using soft-mode kernel gates
|
|
211
|
+
- log:
|
|
212
|
+
- warning presence
|
|
213
|
+
- later edits to the same artifact
|
|
214
|
+
- eventual hard-mode pass/fail
|
|
215
|
+
- downstream rework indicators
|
|
216
|
+
|
|
217
|
+
### Primary metric
|
|
218
|
+
|
|
219
|
+
- correlation between early warnings and later rework
|
|
220
|
+
|
|
221
|
+
### Success criterion
|
|
222
|
+
|
|
223
|
+
- warnings show predictive value strong enough to justify continued soft-mode
|
|
224
|
+
emphasis
|
|
225
|
+
|
|
226
|
+
## Suggested Order
|
|
227
|
+
|
|
228
|
+
The best order is:
|
|
229
|
+
|
|
230
|
+
1. Artifact clarity and completeness
|
|
231
|
+
2. Handoff pickup
|
|
232
|
+
3. Cross-domain corpus fit
|
|
233
|
+
4. Downstream execution quality
|
|
234
|
+
5. Operational warning value
|
|
235
|
+
|
|
236
|
+
That sequence gives the fastest evidence on whether the kernel is genuinely
|
|
237
|
+
useful before investing in longer operational studies.
|
|
238
|
+
|
|
239
|
+
## Minimal Evidence Package To Upgrade v0.1 Claims
|
|
240
|
+
|
|
241
|
+
If we want to move from “strong implementation” to “serious comparative
|
|
242
|
+
validation,” the smallest next package is:
|
|
243
|
+
|
|
244
|
+
1. a 20-prompt comparative artifact study
|
|
245
|
+
2. a pickup study on those same artifacts
|
|
246
|
+
3. a cross-domain corpus fit table
|
|
247
|
+
|
|
248
|
+
That would be enough to substantially strengthen the kernel’s public claims.
|
|
249
|
+
|
|
250
|
+
## Bottom Line
|
|
251
|
+
|
|
252
|
+
The kernel is already technically real and operationally validated.
|
|
253
|
+
|
|
254
|
+
What remains is comparative evidence:
|
|
255
|
+
|
|
256
|
+
- better than free-form?
|
|
257
|
+
- better than a generic checklist?
|
|
258
|
+
- good across real domains?
|
|
259
|
+
- useful in real handoffs?
|
|
260
|
+
|
|
261
|
+
This plan defines how to answer those questions cleanly.
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Evidence Matrix
|
|
2
|
+
|
|
3
|
+
This document separates what the ORP Reasoning Kernel currently proves from
|
|
4
|
+
what it only suggests.
|
|
5
|
+
|
|
6
|
+
Its purpose is to prevent the kernel from being over-claimed. The kernel is
|
|
7
|
+
stronger when we can say, precisely:
|
|
8
|
+
|
|
9
|
+
- what is already validated
|
|
10
|
+
- what is only partially supported
|
|
11
|
+
- what is still unproven
|
|
12
|
+
- what experiment would close the gap
|
|
13
|
+
|
|
14
|
+
Supporting references:
|
|
15
|
+
|
|
16
|
+
- [docs/ORP_REASONING_KERNEL_V0_1.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_V0_1.md)
|
|
17
|
+
- [docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_TECHNICAL_VALIDATION.md)
|
|
18
|
+
- [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json)
|
|
19
|
+
- [docs/ORP_REASONING_KERNEL_AGENT_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md)
|
|
20
|
+
- [docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md)
|
|
21
|
+
- [docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md)
|
|
22
|
+
- [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md)
|
|
23
|
+
|
|
24
|
+
## Evidence Grades
|
|
25
|
+
|
|
26
|
+
- `A`
|
|
27
|
+
Directly supported by shipped implementation, tests, and repeatable benchmark
|
|
28
|
+
evidence in this repo.
|
|
29
|
+
- `B`
|
|
30
|
+
Strongly supported by implementation behavior and design logic, but still
|
|
31
|
+
missing comparative or external validation.
|
|
32
|
+
- `C`
|
|
33
|
+
Plausible and well-motivated, but not yet measured directly.
|
|
34
|
+
- `D`
|
|
35
|
+
Strategic aspiration only. No meaningful validation yet.
|
|
36
|
+
|
|
37
|
+
## What Is Sealed For v0.1
|
|
38
|
+
|
|
39
|
+
These claims are strong enough to treat as validated implementation truths for
|
|
40
|
+
the current kernel release.
|
|
41
|
+
|
|
42
|
+
| Claim | Grade | Current Evidence | Why It Matters |
|
|
43
|
+
| --- | --- | --- | --- |
|
|
44
|
+
| ORP has a real typed kernel artifact surface. | A | [spec/v1/kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json), [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py) | The kernel is not just prose. It is an enforceable CLI surface. |
|
|
45
|
+
| `orp init` seeds a valid starter kernel artifact and validates it in the default flow. | A | [tests/test_orp_init.py](/Volumes/Code_2TB/code/orp/tests/test_orp_init.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | New repos get the kernel by default instead of needing manual adoption. |
|
|
46
|
+
| All seven v0.1 artifact classes can scaffold and validate successfully. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel is broad enough for multiple project artifact types. |
|
|
47
|
+
| Hard mode blocks invalid promotable artifacts. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | ORP can enforce structural promotion standards rather than only advising. |
|
|
48
|
+
| Soft mode records invalidity without blocking work. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | ORP can stay fluid at intake while still surfacing missing structure. |
|
|
49
|
+
| Existing `structure_kernel` gates remain compatible when no explicit kernel config is present. | A | [tests/test_orp_kernel.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel does not silently break earlier ORP configurations. |
|
|
50
|
+
| One-shot local kernel CLI operations are within human-scale latency on the reference machine. | A | [scripts/orp-kernel-benchmark.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-benchmark.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel is operationally lightweight enough to use during normal work. |
|
|
51
|
+
| A small cross-domain reference corpus fits the current class set cleanly. | A | [examples/kernel/corpus](/Volumes/Code_2TB/code/orp/examples/kernel/corpus), [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel now has explicit cross-domain fit evidence, not only rationale. |
|
|
52
|
+
| Each artifact class rejects a candidate when a required field is removed. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | Class-specific enforcement is directly proven instead of inferred from a subset of cases. |
|
|
53
|
+
| The CLI validator stays aligned with the published kernel schema. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel no longer relies on an undocumented validator rule set drifting away from the schema. |
|
|
54
|
+
| Equivalent YAML and JSON artifacts validate to the same semantic result. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The protocol is representation-stable rather than format-sensitive. |
|
|
55
|
+
| The validator rejects adversarial near-miss artifacts. | A | [tests/test_orp_kernel_corpus.py](/Volumes/Code_2TB/code/orp/tests/test_orp_kernel_corpus.py), [docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_v0_1_validation.json) | The kernel is stronger against malformed or gameable inputs than before. |
|
|
56
|
+
| On a matched internal comparison corpus, kernel artifacts outperform both free-form and generic checklist artifacts on structural scoring. | A | [docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_COMPARISON_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_comparison_v0_1.json), [scripts/orp-kernel-comparison.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-comparison.py) | ORP now has direct comparative evidence for structural artifact quality on a matched internal corpus, not only rationale. |
|
|
57
|
+
| On a matched internal pickup proxy, kernel artifacts preserve more explicit handoff-critical information than both free-form and generic checklist artifacts. | A | [docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_PICKUP_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json), [scripts/orp-kernel-pickup.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-pickup.py) | ORP now has a second comparative signal showing that kernel structure turns into more explicit pickup value, not just fuller-looking artifacts. |
|
|
58
|
+
| On a matched live Codex recoverability simulation, kernel artifacts preserve full required-field recoverability, outperform free-form artifacts on all matched cases, and outperform generic checklist artifacts on average without per-case losses. | A | [docs/ORP_REASONING_KERNEL_AGENT_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_pilot_v0_1.json), [scripts/orp-kernel-agent-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-pilot.py) | ORP now has direct in-environment agent evidence that the kernel’s structural advantage survives contact with a real fresh downstream Codex session. |
|
|
59
|
+
| On a `10`-repeat full-corpus live Codex replication pilot, the kernel’s recoverability advantage stays stable across fresh-session reruns, with zero invention, no run-level losses, and perfect per-field stability on required kernel fields. | A | [docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_AGENT_REPLICATION.md), [docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_agent_replication_v0_2.json), [scripts/orp-kernel-agent-replication.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-agent-replication.py) | ORP now has stronger repeatability evidence that the live agent result is not just a single-run artifact and that the structural advantage survives at field level, not only in aggregate means. |
|
|
60
|
+
| On a matched full-corpus live continuation pilot, kernel artifacts support the strongest continuation score, never underperform the generic checklist baseline, and keep invention at zero. | A | [docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CONTINUATION_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_continuation_v0_1.json), [scripts/orp-kernel-continuation-pilot.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-continuation-pilot.py) | ORP now has direct agent-first evidence that kernel artifacts are not only recoverable, but also a safe and effective base for downstream continuation. |
|
|
61
|
+
| On a harder matched full-corpus canonical continuation pilot, kernel artifacts beat free-form on every case, beat checklist on average, and keep the lowest invention rate while revealing checklist as a real competitive baseline on some cases. | A | [docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_CANONICAL_CONTINUATION_PILOT.md), [docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_canonical_continuation_v0_1.json), [scripts/orp-kernel-canonical-continuation.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-canonical-continuation.py) | ORP now has a stricter downstream-agent benchmark where the task is not merely “continue safely,” but “produce the next canonical artifact” without inventing unsupported structure. |
|
|
62
|
+
|
|
63
|
+
## What Is Strong But Not Fully Sealed
|
|
64
|
+
|
|
65
|
+
These claims are directionally convincing, but still need comparative or
|
|
66
|
+
broader validation before they should be presented as fully proven.
|
|
67
|
+
|
|
68
|
+
| Claim | Grade | Current Evidence | Missing Evidence | Best Next Experiment |
|
|
69
|
+
| --- | --- | --- | --- | --- |
|
|
70
|
+
| The seven chosen artifact classes are a good universal first set. | B | Cross-domain rationale, passing roundtrip coverage, a small five-domain reference corpus, and full required-field enforcement across all classes | Larger real-world corpus coverage across multiple project types | Cross-domain corpus fit study at real-project scale |
|
|
71
|
+
| The kernel reduces accidental chat-to-truth drift. | B | ORP’s promotion model and hard/soft gate behavior | Before/after repo evidence showing fewer ambiguous promotable artifacts | Comparative artifact hygiene study |
|
|
72
|
+
| The kernel is better than a generic checklist. | B | Typed classes give stronger semantic requirements than one generic field list, the internal comparison pilot shows a structural scoring advantage, the live Codex pilot shows a higher mean recoverability score with no per-case losses, and the harder canonical continuation pilot still favors the kernel on aggregate while showing checklist is genuinely competitive | Blinded human review and downstream task outcomes | Structured baseline comparison |
|
|
73
|
+
| The kernel is better than free-form artifact writing alone. | B | Implementation logic is strong, promotion semantics exist, the internal comparison and pickup pilots favor the kernel, and the live Codex pilot shows a `7/7` case advantage on recoverability | Outcome comparison on real artifacts and reviewers | Free-form vs kernel artifact review study |
|
|
74
|
+
| The kernel can scale across agents and handoffs. | B | Machine-checkable artifacts, visible run traces, a matched internal pickup proxy, a live Codex recoverability pilot showing stronger downstream field recovery, and first replication/continuation smokes that preserve the same general pattern | Multi-operator pickup data across models and humans | Handoff pickup experiment |
|
|
75
|
+
|
|
76
|
+
## What Is Still Unproven
|
|
77
|
+
|
|
78
|
+
These are important claims, but they should not be stated as established fact
|
|
79
|
+
yet.
|
|
80
|
+
|
|
81
|
+
| Claim | Grade | Why It Is Not Yet Proven | What Would Prove It |
|
|
82
|
+
| --- | --- | --- | --- |
|
|
83
|
+
| Kernel use improves downstream project outcomes. | D | No controlled comparison yet between kernel and non-kernel workflows | Comparative study over real tasks with outcome scoring |
|
|
84
|
+
| Kernel-aware agents produce better work than agents without kernel structure. | D | No A/B benchmark on identical prompts and acceptance criteria | Agent benchmark across matched tasks |
|
|
85
|
+
| Kernel warnings correlate with later rework or quality failures. | D | No operational data collection yet | Longitudinal study on soft-mode warnings vs later edits |
|
|
86
|
+
| The current class set is close to optimal. | D | No competing class-model comparison or pruning study | Corpus analysis plus ablation tests |
|
|
87
|
+
| The kernel is equally suitable across software, research, ops, and writing domains. | C | Small cross-domain corpus fit is good, but no broad real-project or reviewer evidence yet | Cross-domain corpus and reviewer study |
|
|
88
|
+
|
|
89
|
+
## Research Questions That Matter Most
|
|
90
|
+
|
|
91
|
+
If we want to move from “technically valid” to “research-grade validated,” the
|
|
92
|
+
highest-value questions are:
|
|
93
|
+
|
|
94
|
+
1. Does the kernel reduce ambiguity relative to free-form project artifacts?
|
|
95
|
+
2. Does the kernel improve handoff pickup quality for humans and agents?
|
|
96
|
+
3. Does the kernel improve downstream implementation or review success?
|
|
97
|
+
4. Is the typed class model actually better than a simpler generic checklist?
|
|
98
|
+
5. Where does the current class set fail across real project domains?
|
|
99
|
+
|
|
100
|
+
## Recommended Public Wording Right Now
|
|
101
|
+
|
|
102
|
+
These are safe claims now:
|
|
103
|
+
|
|
104
|
+
- ORP ships a typed reasoning-kernel artifact layer.
|
|
105
|
+
- ORP can scaffold, validate, and gate promotable kernel artifacts.
|
|
106
|
+
- ORP supports hard and soft promotion semantics for kernel validation.
|
|
107
|
+
- ORP includes repeatable benchmark evidence for the current implementation.
|
|
108
|
+
|
|
109
|
+
These should still be framed as goals or hypotheses:
|
|
110
|
+
|
|
111
|
+
- the kernel improves project outcomes
|
|
112
|
+
- the kernel is superior to other artifact-structuring approaches
|
|
113
|
+
- the current class set is the right final ontology
|
|
114
|
+
|
|
115
|
+
## Bottom Line
|
|
116
|
+
|
|
117
|
+
The kernel is sealed for `v0.1` as an implementation and protocol surface.
|
|
118
|
+
|
|
119
|
+
Its internal validity is now much stronger than a simple happy-path release:
|
|
120
|
+
|
|
121
|
+
- schema and validator rules align directly
|
|
122
|
+
- every required field is exercised through ablation
|
|
123
|
+
- equivalent YAML and JSON artifacts behave the same
|
|
124
|
+
- adversarial near-miss artifacts are rejected
|
|
125
|
+
- a small cross-domain corpus fits the current class set
|
|
126
|
+
- a live fresh-agent Codex pilot shows the kernel’s recoverability advantage survives beyond deterministic local scoring
|
|
127
|
+
- a `10`-repeat full-corpus replication pilot and two levels of continuation pilot suggest that advantage is reasonably stable and carries into downstream agent continuation behavior
|
|
128
|
+
|
|
129
|
+
It is not yet sealed as a universally outcome-superior project methodology.
|
|
130
|
+
|
|
131
|
+
That is a good place to be if we stay explicit about the difference.
|
|
@@ -0,0 +1,123 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Evolution
|
|
2
|
+
|
|
3
|
+
This document defines how the ORP kernel should get stronger over time without
|
|
4
|
+
becoming slippery.
|
|
5
|
+
|
|
6
|
+
## Core Rule
|
|
7
|
+
|
|
8
|
+
The kernel should be:
|
|
9
|
+
|
|
10
|
+
- stable as a contract
|
|
11
|
+
- adaptive through governed evolution
|
|
12
|
+
- never silently rewritten by agents on the fly
|
|
13
|
+
|
|
14
|
+
Short form:
|
|
15
|
+
|
|
16
|
+
- stable core
|
|
17
|
+
- observable pressure
|
|
18
|
+
- explicit evolution
|
|
19
|
+
|
|
20
|
+
## What Should Stay Stable
|
|
21
|
+
|
|
22
|
+
The current core kernel remains the canonical source of truth for:
|
|
23
|
+
|
|
24
|
+
- artifact classes
|
|
25
|
+
- required-field rules
|
|
26
|
+
- hard vs soft promotion behavior
|
|
27
|
+
- machine-readable kernel validation in `RUN.json`
|
|
28
|
+
|
|
29
|
+
Those semantics live in:
|
|
30
|
+
|
|
31
|
+
- [spec/v1/kernel.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel.schema.json)
|
|
32
|
+
- [cli/orp.py](/Volumes/Code_2TB/code/orp/cli/orp.py)
|
|
33
|
+
|
|
34
|
+
The kernel should not self-mutate from a single chat, a single agent guess, or
|
|
35
|
+
one repo’s habits.
|
|
36
|
+
|
|
37
|
+
## What Should Evolve
|
|
38
|
+
|
|
39
|
+
The kernel should evolve from evidence about real use:
|
|
40
|
+
|
|
41
|
+
- repeated missing fields
|
|
42
|
+
- repeated field invention
|
|
43
|
+
- repeated continuation failures
|
|
44
|
+
- recurring requests for the same extra structure
|
|
45
|
+
- extension fields that become broadly useful
|
|
46
|
+
|
|
47
|
+
That evidence should shape proposals and migrations, not rewrite the core
|
|
48
|
+
implicitly.
|
|
49
|
+
|
|
50
|
+
## CLI Surfaces
|
|
51
|
+
|
|
52
|
+
ORP now exposes three explicit kernel-evolution surfaces:
|
|
53
|
+
|
|
54
|
+
### `orp kernel stats`
|
|
55
|
+
|
|
56
|
+
Observe real kernel-validation pressure from `RUN.json` artifacts.
|
|
57
|
+
|
|
58
|
+
Use it to answer questions like:
|
|
59
|
+
|
|
60
|
+
- which fields are repeatedly missing?
|
|
61
|
+
- which artifact classes fail most often?
|
|
62
|
+
- where is the current kernel strained in live repo usage?
|
|
63
|
+
|
|
64
|
+
### `orp kernel propose`
|
|
65
|
+
|
|
66
|
+
Scaffold a governed kernel-evolution proposal artifact.
|
|
67
|
+
|
|
68
|
+
Use it for changes like:
|
|
69
|
+
|
|
70
|
+
- adding a field
|
|
71
|
+
- introducing a new artifact class
|
|
72
|
+
- changing a requirement
|
|
73
|
+
- deprecating an old field
|
|
74
|
+
|
|
75
|
+
Proposal shape is governed by:
|
|
76
|
+
|
|
77
|
+
- [spec/v1/kernel-proposal.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel-proposal.schema.json)
|
|
78
|
+
|
|
79
|
+
### `orp kernel migrate`
|
|
80
|
+
|
|
81
|
+
Rewrite an artifact into the current canonical field order and schema version.
|
|
82
|
+
|
|
83
|
+
Use it to:
|
|
84
|
+
|
|
85
|
+
- normalize older artifacts
|
|
86
|
+
- apply explicit schema-version upgrades
|
|
87
|
+
- preserve stable truth while the kernel evolves
|
|
88
|
+
|
|
89
|
+
## Extensions
|
|
90
|
+
|
|
91
|
+
The best place for new pressure to land first is usually not the core kernel.
|
|
92
|
+
It should begin as an extension or proposal before becoming universal.
|
|
93
|
+
|
|
94
|
+
Extension shape is defined in:
|
|
95
|
+
|
|
96
|
+
- [spec/v1/kernel-extension.schema.json](/Volumes/Code_2TB/code/orp/spec/v1/kernel-extension.schema.json)
|
|
97
|
+
|
|
98
|
+
That gives ORP a place to trial domain-specific structure without forcing it
|
|
99
|
+
into every project prematurely.
|
|
100
|
+
|
|
101
|
+
## Recommended Kernel Evolution Loop
|
|
102
|
+
|
|
103
|
+
1. Observe pressure with `orp kernel stats`
|
|
104
|
+
2. Write an explicit proposal with `orp kernel propose`
|
|
105
|
+
3. Test the proposal against benchmark corpus and live agent pickup/continuation
|
|
106
|
+
4. If accepted, version the schema deliberately
|
|
107
|
+
5. Normalize older artifacts with `orp kernel migrate`
|
|
108
|
+
6. Protect the committed evidence package with CI threshold checks
|
|
109
|
+
|
|
110
|
+
## Non-Goal
|
|
111
|
+
|
|
112
|
+
The kernel should not become a hidden adaptive prompt system that silently
|
|
113
|
+
changes what truth means.
|
|
114
|
+
|
|
115
|
+
The repository should always be able to answer:
|
|
116
|
+
|
|
117
|
+
- which kernel version is in effect?
|
|
118
|
+
- why was it changed?
|
|
119
|
+
- what evidence justified the change?
|
|
120
|
+
- how do older artifacts migrate safely?
|
|
121
|
+
|
|
122
|
+
That is the ORP standard for a living protocol: dynamic in evidence, explicit
|
|
123
|
+
in truth.
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
# ORP Reasoning Kernel Pickup Pilot
|
|
2
|
+
|
|
3
|
+
This document records the first in-repo pickup and handoff proxy for the ORP
|
|
4
|
+
Reasoning Kernel.
|
|
5
|
+
|
|
6
|
+
Supporting artifact:
|
|
7
|
+
|
|
8
|
+
- [docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json](/Volumes/Code_2TB/code/orp/docs/benchmarks/orp_reasoning_kernel_pickup_v0_1.json)
|
|
9
|
+
|
|
10
|
+
Supporting corpus and harness:
|
|
11
|
+
|
|
12
|
+
- [examples/kernel/comparison/comparison-corpus.json](/Volumes/Code_2TB/code/orp/examples/kernel/comparison/comparison-corpus.json)
|
|
13
|
+
- [scripts/orp-kernel-pickup.py](/Volumes/Code_2TB/code/orp/scripts/orp-kernel-pickup.py)
|
|
14
|
+
|
|
15
|
+
## What This Pilot Measures
|
|
16
|
+
|
|
17
|
+
This pilot measures **explicit pickup readiness** on a matched internal corpus.
|
|
18
|
+
|
|
19
|
+
For each artifact class, it asks whether a downstream operator could recover
|
|
20
|
+
the class-specific pickup targets directly from the artifact itself.
|
|
21
|
+
|
|
22
|
+
Examples:
|
|
23
|
+
|
|
24
|
+
- `task`
|
|
25
|
+
- object
|
|
26
|
+
- constraints
|
|
27
|
+
- success criteria
|
|
28
|
+
- `decision`
|
|
29
|
+
- question
|
|
30
|
+
- chosen path
|
|
31
|
+
- consequences
|
|
32
|
+
- `checkpoint`
|
|
33
|
+
- current state
|
|
34
|
+
- risks
|
|
35
|
+
- next handoff target
|
|
36
|
+
|
|
37
|
+
The pilot compares:
|
|
38
|
+
|
|
39
|
+
1. free-form artifact writing
|
|
40
|
+
2. generic checklist artifact writing
|
|
41
|
+
3. ORP typed kernel artifact writing
|
|
42
|
+
|
|
43
|
+
## What This Is And Is Not
|
|
44
|
+
|
|
45
|
+
This is a **pickup proxy**, not a full live handoff study.
|
|
46
|
+
|
|
47
|
+
It does **not** prove:
|
|
48
|
+
|
|
49
|
+
- time-to-understanding with real operators
|
|
50
|
+
- clarification count under live review
|
|
51
|
+
- downstream execution quality
|
|
52
|
+
|
|
53
|
+
It **does** prove something narrower and still valuable:
|
|
54
|
+
|
|
55
|
+
how much pickup-critical information remains explicitly recoverable in the
|
|
56
|
+
artifact itself.
|
|
57
|
+
|
|
58
|
+
## Current Result
|
|
59
|
+
|
|
60
|
+
On the matched internal corpus, the current report shows:
|
|
61
|
+
|
|
62
|
+
- kernel mean pickup score: `1.000`
|
|
63
|
+
- generic checklist mean pickup score: `0.743`
|
|
64
|
+
- free-form mean pickup score: `0.452`
|
|
65
|
+
|
|
66
|
+
Pairwise result:
|
|
67
|
+
|
|
68
|
+
- kernel beats generic checklist on `7/7` cases
|
|
69
|
+
- kernel beats free-form on `7/7` cases
|
|
70
|
+
- generic checklist beats free-form on `7/7` cases
|
|
71
|
+
|
|
72
|
+
Additional result:
|
|
73
|
+
|
|
74
|
+
- kernel keeps all pickup targets explicitly answerable on the matched corpus
|
|
75
|
+
|
|
76
|
+
## Why This Matters
|
|
77
|
+
|
|
78
|
+
This pilot strengthens the kernel story in a way the pure structure benchmark
|
|
79
|
+
could not.
|
|
80
|
+
|
|
81
|
+
The earlier comparison pilot showed that kernel artifacts are structurally
|
|
82
|
+
fuller than the simpler alternatives.
|
|
83
|
+
|
|
84
|
+
This pickup pilot shows that the added structure is not decorative. It turns
|
|
85
|
+
into directly recoverable handoff value.
|
|
86
|
+
|
|
87
|
+
That is important because ORP is not trying to optimize for pretty artifacts.
|
|
88
|
+
It is trying to optimize for artifacts that another human or agent can pick up
|
|
89
|
+
and continue without confusion.
|
|
90
|
+
|
|
91
|
+
## Honest Caveat
|
|
92
|
+
|
|
93
|
+
This pilot still does not seal the kernel as a universally outcome-superior
|
|
94
|
+
methodology.
|
|
95
|
+
|
|
96
|
+
It is stronger evidence than a rationale-only claim, but it remains an
|
|
97
|
+
internal, deterministic proxy. The next step after this is still a live
|
|
98
|
+
human/agent pickup study as described in
|
|
99
|
+
[docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md](/Volumes/Code_2TB/code/orp/docs/ORP_REASONING_KERNEL_EVALUATION_PLAN.md).
|
|
100
|
+
|
|
101
|
+
## Bottom Line
|
|
102
|
+
|
|
103
|
+
The pickup pilot makes the kernel harder to dismiss.
|
|
104
|
+
|
|
105
|
+
We now have evidence that on a matched internal corpus, kernel artifacts do
|
|
106
|
+
not just score as more structured. They also preserve more explicit handoff
|
|
107
|
+
value than free-form and generic checklist alternatives.
|